by Tim Standing, Vice President of Software Engineering, OWC Holdings Inc.
December 1, 2017
Although we are hard at work on SoftRAID version 6 (which will support APFS) we’re still creating maintenance releases of our currently shipping version, SoftRAID version 5. And one feature we really wanted to add to the next SoftRAID version 5 release was the reporting of the TBW (Total Bytes Written) value for SSDs.
What is the TBW and why does it matter?
The TBW measures how many bytes have been written to an SSD since it left the factory. You can think of the TBW like a car’s ‘odometer’, logging SSD disk usage instead of miles. A car’s odometer grows with every mile driven, while the TBW steadily increases with every byte written to the SSD.
The reason this value matters is that some SSD manufacturers are now including a maximum TBW value in the their warranty terms. For instance, Samsung now warranties all their SSDs either for a certain number of years, or for a maximum number of terabytes written, whichever comes first.
We wanted to make it easy for users to see the Total Bytes Written value for their SSD.
This is just like the warranty for your car which might be 3 years or 36,000 miles, which ever comes first. If you’ve only owned your car for 25 months but have driven 45,000 miles, you can no longer get repairs under warranty. Likewise, if your SSD stops working after you’ve had it for only a couple of months, you won’t be able to replace it under warranty if you’ve exceeded the permissable number of bytes written. How would you even know how many bytes that is without having access to the TBW value?
We wanted to make it easy for users to see the TBW value for the SSDs they were using, so they could better manage SSD usage and keep below the manufacturer’s threshold for warranty coverage. So, for the latest release version of SoftRAID 5 (5.6.4), we’ve introduced code to retrieve and display the TBW for SSDs, making it easy to track your SSD usage.
We need your help!
The way each SSD reports this data is different for each manufacturer; sometimes it even alters between different models from the same manufacturer! Since it’s hard for us to know how every single model of every SSD from every manufacturer does its reporting, we’re asking SoftRAID users to help us get the data we need to calculate the TBW numbers accurately for different models of SSD. We want to make this feature as reliable as possible for everyone, no matter what SSD they are using.
Help us calculate TBW numbers accurately for different SSDs
If you expand the disk tile of an SSD while running SoftRAID version 5.6.4, you’ll see the TBW for your SSD with the label “total bytes written”.
If the SoftRAID application does not show this information it means we’re not able to accurately report TBW data for the particular SSD you are using. If this happens, please let us know and we’ll work with you to get the data necessary to add support for your brand of SSD.
by Tim Standing, Vice President of Software Engineering, OWC Holdings Inc.
November 17, 2017
After 16 months of using and testing APFS—Apple’s new file system—I’ve come to the conclusion that you probably don’t want to use it on HDDs (disks with rotating platters).
Why? Well, to understand why APFS and HDDs are not well suited, I first need to explain one of the key features of APFS: ‘copy on write’. ‘Copy on write’ is the magic behind the snapshot feature in APFS and also allows you to copy really large files in in only a couple of seconds. However, to fully understand the ‘copy on write’ process, and the implications of using APFS with HDDs, it helps first to know how copying works with HFS Extended volumes…
Copying a file on an HFS Extended volume
HFS Extended is the file system Apple has been using for almost 30 years, the one which all Macs running macOS 10.12 or earlier use for their startup volumes.
For my example, I am using a 10 GB movie file, “Nina’s Birthday.mp4”, which is stored in two separate blocks of data on the volume. When I play this movie file on my computer, my Mac will first read the first block and then go straight on to read the second block; it seamlessly moves from one block to the next so that, to the viewer, the movie appears as if it was a single block of data. Files on your Mac can exist in one or many blocks. Small files usually exist in one block whereas larger files are often broken up into 2 or more blocks so they can fit into the available free space in a volume.
Unlike SSDs, HDDs are mechanical devices with spinning disks (aka platters) containing your volume’s data, and heads that move over the disk in order to read that data. When a HDDs has to go to a new part of a disk, there is a delay while the head moves to the new location and waits for the the correct part of the disk platter to be under the head so it can start reading. This delay is usually 4–10 msec (1/250–1/100 of a second). You probably won’t notice a delay when reading a file which is in 2 or 3 blocks, but reading a file which is made up of 1,000 or 10,000 blocks could be painfully slow.
Each of the one or more blocks that make up a single file is called an extent. The file system maintains a table of these extents (one per file) called an extents table. The extents table records the location of every block in the file (the offset) and the length of that block (length). In this way, the computer knows where to go on the disk and how much data to read from that location. For every block of data in a file there is an offset and a length, which together make up a single extent in the extents table. This is the important thing to remember when you go on to read about how APFS deals with files. The “Nina’s Birthday.mp4” file in my example has two extents, the first of which is 2 GB in length and the second of which is 8 GB.
So let’s say I need to make a copy of this file. When I copy the file on my HFS Extended volume, my Mac reads the file’s data, locates a free space in the volume for the copy, and then writes the copied data out. If it can, the Mac will write the new file out as a single block. However, in my example, the volume doesn’t contain a single block of space that is 10 GB in size so it has to write out the file as 2 blocks: the first 4 GB in length and the second 6 GB. Both the original file and the copy can be read relatively quickly because each has only 2 blocks, thus 2 extents.
If I now edit the original movie and add four edits (say transitions between different scenes), when I save the changes, they will be written out over the existing data for this file. Even after the edits, my movie file will still contain only 2 extents and can be read relatively quickly.
Copying a file on an APFS volume
For my example with an APFS volume, I will start with the same movie file, “Nina’s Birthday.mp4,” which is made up of 2 extents, the first 2 GB in length and the second 8 GB.
When I copy this file on an APFS volume, the file data doesn’t actually get copied to a new location on the disk. Instead, the file system knows that both the original and the copy contain the exact same data, so both the original file and its copy point to (reference) the same data. They may look like separate files in the Finder but, under the hood, both filenames point to the same place on the disk. And although the original and the copy each has its own extents table, the extents tables are identical.
This is the magic of copy on write and the reason copying a 100 GB file only takes a few seconds: no data is actually being copied! The file system is just creating a new extents table for the copy (there may be other information it needs to keep track of for the new file, but that’s not important in this example).
I mentioned above that with APFS, an original file and its copy will have identical extents tables. However, this is true only until you make a change to one of them. When I go to create the same 4 transitions in my movie that I created when using my HFS Extended volume, APFS has to find new, unused, space on the disk to store these edits. It can’t write the edits over the original file, like the HFS Extended volume does, because then the changes would exist in both the original file and its copy—remember that the extents table for the file and its copy point to the same location on the disk. So that would be really bad.
Instead, APFS creates a new extent for each of the edits. It also has to create a new extent for the remaining data after the transition, the part of the movie which comes after the transition and which is still the same in both the original movie and its copy. Therefore, for each non-contiguous write, the file system has to create 2 new extents, one for the changed data and one for the original data (common to the original file and its copy) which follows the new data. If this sounds complicated it’s because it is—requiring multiple back-and-forths between the locations of the original file and the files with all the changes. Each back-and-forth is recorded as a new extent.
After writing out my 4 transitions, my original movie file now has 10 extents. This might not seem like a lot of extents but that’s for only 4 edits! Editing an entire movie, or even just retouching a photo could result in thousands of extents. Imagine what happens with with the file used to store your mail messages when you are getting hundreds or thousands of messages a week. And if you are someone who uses Parallels or VM Ware Fusion, each time you start up your virtual machine it probably results in 100,000 writes. You can see that any of these types of files could easily get many thousand extents.
Now imagine what will happen when your Mac goes to read a file with a thousand or more extents on an HDD. As the file system reaches the end of one extent and starts reading from the next one, it has to wait the 4–10 msec for the disk’s platter and head to get aligned correctly to begin reading the next extent. Multiply this delay by 1,000 or more and the time taken to read these files could become unbearably long.
This long delay when reading large files is the reason I don’t recommend using APFS on HDDs. This delay will only occur with files which have been written to a lot, and if the file has been copied or the volume has a snapshot. But who wants to use a volume where you have to remember not to copy files or use Time Machine?
I think Apple is aware of this problem as they tell you not to automatically convert startup volumes on HDDs to APFS when upgrading to High Sierra. In addition, when erasing a disk, the Disk Utility application only chooses APFS as the default file system if it can confirm that the disk is an SSD.
The proof: I knew from the start that this was how copy on write was supposed to work, but just to be sure, I wanted to see what was actually going on at the disk level. Since I am the developer for SoftRAID, I can use the SoftRAID driver to allow me to watch what is actually going on.
I created a special version of the SoftRAID driver which allowed me to record where on a disk the file system was reading and writing data and how much data was transferred each time. I then edited a file on both HFS Extended and APFS volumes.
With a file on an HFS Extended volume, I could see the original data being overwritten in the same location. I saw this same behavior with a file on an APFS volume as long as the file had not been copied or a snapshot did not exist for this volume. As soon as I copied the file or created a snapshot of the volume, all writes were made to new locations on the volume, locations which were not part of the original file.
by Tim Standing, Vice President of Software Engineering, OWC Holdings Inc.
November 15, 2017
From previous testing, I knew that an HFS Extended volume with 2 ThunderBlades striped together could go go faster than 4 gigabytes per second on both reads and writes.
Speed of HFS Extended volume on 2 OWC ThunderBlades striped together with SoftRAID
So just how fast is the release version of APFS which comes with High Sierra? Can it keep up with HFS? To find out, I created an APFS volume using a beta copy of SoftRAID version 6 and two ThunderBlades.
As you can see, the APFS volume is almost exactly the same speed as HFS Extended.
Speed of APFS volume on 2 OWC ThunderBlades striped together with SoftRAID
APFS & HFS Extended Finder copy speeds with ThunderBlades, SSDs and HDDs
note: the taller columns represent slower copying speeds, so shorter is better!
3 years after setup, SoftRAID 4 saves Futureproof Record’s files
In January 2014, Nigel Hobden configured a Mac Pro 5.1, running Mac OS 10.8.5 (Mountain Lion), for Phil Legg and the team at FutureProof Records. Futureproof Records and Promotions—a UK music company dealing in music promotion and distribution—needed a system to store and protect a large amount of crucially important files and applications.
Nigel’s configuration included a pair of 2TB drives, mirrored with SoftRAID 4.5.4, plus a Time Machine backup disk in one of the bays, as an extra precaution.
Critical applications running on the Mac Pro required them to stick with Mountain Lion, so they just let the system keep running, which it’s been doing continuously since setup. Two or three times since January 2014 the machine crashed. When this happened, SoftRAID 4 alerted them that the volume was out of sync, then successfully rebuilt the Mirror volume. Other than these occasional glitches, the configuration has been totally stable. The OS was never updated, and neither was the version of SoftRAID.
Fast forward to July 2017: Three and a half years after setup, on July 9, 2017, SoftRAID 4 popped up a dialog box: “Disk predicted to fail”
Even though the system appeared to be running fine, Phil at FutureProof was concerned, so (as advised in the online help) decided to generate a technical support file and send it to SoftRAID support. When he launched SoftRAID to create the file he discovered that one of the disks had a “Failure Predicted” alert – a much more critical issue. Sectors were being reallocated at an alarming rate.
Phil immediately emailed SoftRAID, writing “This is obviously most concerning in terms of data loss, and [we] would appreciate if you could let me know how to resolve this issue in the best possible way as quickly as possible.”
Remember this was an much older version of SoftRAID (the current version is 5.6.1). SoftRAID 4.5.4 is no longer shipped. Not many companies would rush to help someone using an older, unsupported version of their software, but, fortunately for Phil, SoftRAID stands by its software, and its customers.
Mark James, at SoftRAID technical support, got back to Phil right away, telling him that the disk urgently needed replacing. Mark explained that when a disk starts reallocating sectors that quickly, the disk is in a death spiral and may not last even a few days. Phil ordered new disks on the spot.
“No matter how old the version, we stand by our software and its ability to protect the user against the disaster of disk failure”—SoftRAID Team
Mark also noted that the Mirror disks used in the system had over 30,000 hours of use each, so were at the end of their expected life. In addition, the system’s older Hitachi boot disk had over 50,000 hours of use—well past the expected lifetime of that drive.
Although the disks have been replaced, the Mac Pro is still in service, still running Mountain Lion, but is now benefitting from the even greater protection offered by SoftRAID 5.6.1, which displays reallocation sector counts directly in the disk tile.
A story with a happy ending and a satisfied customer due to the SoftRAID team’s belief that no matter how old the version, we stand by our software and its ability to protect the user against the disaster of disk failure.
We also believe in providing amazing technical support, with pretty much round-the-clock service, to ensure that our customers get the help they need—whenever they need it—to keep their data safe!
Phil Legg and FutureProof certainly think so!
by Jarrod Rice, Content Marketing Manager, OWC
February 28 2017
For industry-leading data recovery specialists DriveSavers, the recovery process is a numbers game of zeros and ones.
When a customer sends in a drive that is physically damaged, the company’s recovery engineers make an image of the binary code—the fundamental zeros and ones—that makes up the drive’s data. From there, DriveSavers Director of Engineering Mike Cobb says the team turns these raw numbers back into customer files. “It’s kind of like a photocopy,” Cobb explains. “That way, whatever happens down the line, we don’t have to go back to the original damaged drive and try to retrieve it again.”
“RAID 5 is incredibly vital to us because hard drives can and will fail. We want to have the best chance of getting around a failure if there is one. SoftRAID gives us that necessary assurance.”—Mike Cobb, Drivesavers
Those “photocopies” are crucial to the process of safely recovering lost data. Alongside those fundamental zeros and ones, a third number features heavily in the workflow at DriveSavers too—5. As in RAID 5. Like countless other professionals, DriveSavers relies on SoftRAID’s unmatched RAID 5 capabilities.
RAID 5 for Peace of Mind, Performance, and Protection
DriveSavers works with the digital “photocopies” and recovered data on their active servers. It’s here where extreme data safety without sacrificing performance is paramount, so DriveSavers turns to SoftRAID. Because even the masters of data recovery have a backup plan.
“ Right now we have over a petabyte of active storage, but that space is not unlimited for customer data. With SoftRAID, we can offload…that data into a fairly large data set that’s redundant and put it into a cold storage, …freeing up our servers, which saves costs.”—Mike Cobb, Drivesavers
“Our servers are temporary storage, so if there are circumstances where the data is held here internally for an extended period of time, that’s where SoftRAID comes in,” Cobb says. “We use SoftRAID because of its implementation of RAID 5 and because redundancy is built in.”
DriveSavers, along with countless other pros, relies on RAID 5 because it provides professional level performance along with data protection through parity. For each block of data in a RAID 5 configuration, a parity code is calculated and then stored across the other drives. If one drive fails, it can be replaced and its content can be restored, rebuilding the data from the parity information.
SoftRAID delivers the most advanced and configurable software RAID 5, ideal for the kind of mission critical data protection that DriveSavers needs on a day-to-day basis. “RAID 5 is incredibly vital to us because hard drives can and will fail. We want to have the best chance of getting around a failure if there is one. SoftRAID gives us that necessary assurance” Cobb added. Offloading data to a long-term, cold storage solution powered by SoftRAID also saves precious server space, which makes fiscal sense for DriveSavers.
“ DriveSavers relies on SoftRAID’s Drive Certification feature to do the testing manufacturers don’t. Quickly and efficiently certifying every sector on every drive they use gives DriveSavers added confidence and prevents potential disaster.”
“Right now we have over a petabyte of active storage, but that space is not unlimited for customer data” Cobb said. “With SoftRAID, we can offload and consolidate that data into a fairly large data set that’s redundant and put it into a cold storage. It’s really about freeing up our servers, which saves costs.”
Certification Means Even More Data Safety
Because HDD manufacturers don’t test every sector on a disk before shipping, DriveSavers relies on SoftRAID’s Drive Certification feature to do the testing manufacturers don’t.
SoftRAID writes a random pattern out to every sector and then verifies the pattern to make sure every sector on the disk is working reliably. Quickly and efficiently certifying every sector on every drive they use gives DriveSavers added confidence and prevents potential disaster.
“For drive certification, we’re verifying that the drives we’re using in SoftRAID are essentially free of sector flaws. We touch every sector and write patterns to all of them. We then read from every one of the sectors and verify that a drive is sound for a RAID environment and able to last a long period of time.”
If drives are flawed, SoftRAID helps DriveSavers find out why. And, SoftRAID Specialist Tech Support is there to help with any issue the DriveSavers team comes across. Detailed private reports are generated easily, so the SoftRAID Tech Support team sees exactly what the user is seeing.
“During drive certification, it’s common to find multiple failures,” Cobb said. “And we’ve worked with the SoftRAID tech support team to find out why these drives fail. SoftRAID makes it very easy to send the actual crash logs to Tech Support so you can diagnose problems very quickly. They’ve always done a great job.”
Second to None Data Loss Prevention
When it comes to data recovery, DriveSavers is second to none. The company’s motto is “We can save it!” and that’s proven every day by the highest data recovery success rate in the industry. And while it’s a comforting thought that DriveSavers is there for the worst case scenario, preventing loss in the first place with the right RAID is the best way to keep your data safe. Because even the top data recovery experts have a backup plan.
by Tim Standing, VP of Engineering at SoftRAID
October 6 2016
A kernel panic is a crash, the type when your computer just freezes—the mouse stops moving and everything on the screen is still. A few seconds later, your computer reboots and you may have lost all the work you had just been doing.
We pride ourselves in fixing 100% of the bugs in SoftRAID which cause kernel panics. We know how disruptive they can be to your everyday life and don’t want to be responsible for you losing work.
So I was really troubled 6 months ago when I started getting kernel panics on my development Mac, the one I use to create the SoftRAID product. I had just started using new virtualization software allowing me to run several different versions of Mac OS X on the same Mac at exactly the same time. Since the kernel panics started when I started using the new software, it seemed natural that the software was causing the kernel panics.
At first, I encountered the kernel panics every 2–3 weeks, always after I had run the virtualization software. Then they started happening every week, finally every 2–3 days. It was getting to be a real nuisance.
During this time, Mark James, one of our support engineers, was helping a customer who was also experiencing kernel panics. The customer said they were happening on his 2013 Mac Pro, while using a RAID 4 volume, created using SoftRAID, and made of 4 SSDs. The customer naturally assumed the kernel panics were caused by SoftRAID because they started occurring after he created his SoftRAID volume.
The interesting point was that this customer only encountered the bug when 64 GB of RAM was installed in the Mac Pro.
We tried to reproduce the problem, using the same amount of RAM (64 GB). However, we could not get the kernel panics to happen. A few weeks later, I asked Mark if he had resolved the issue with the customer. “Yes,” he said, “it was bad RAM. Once the customer replaced the RAM, the kernel panics disappeared entirely.”
The next day, I purchased replacement RAM for my development Mac. Since I installed it, 6 weeks ago, I haven’t had a single kernel panic.
So even though we all want to believe that kernel panics are caused by inferior software, sometimes it is actually a hardware problem—like bad RAM!
At August’s 2016 Flash Media Summit (in Santa Clara, CA) SoftRAID’s VP of Engineering, Tim Standing, talked about the challenges around SSD failure prediction.
September 9 2016
Tim started off talking about SoftRAID’s efforts to make storage more reliable: “In 2010, we added a feature for predicting disk failure, which used the results from a Google study on 100,000 rotating media disk drives. This feature can warn users weeks or months before a disk fails. The feature predicts about 75% of disk drive failures, the other 25% of the failures happen without any warning.”
SoftRAID’s success in predicting disk failure in rotating media spurred Tim and his team to develop a similar system for SSDs: “After we saw the power of failure prediction, we wanted to develop the same feature for SSDs.”
For those of us who don’t know why SSDs can’t use the same process as rotating media for failure prediction, Tim explains: “When disks with rotating media are about to fail, they start reallocating sectors. We can use the reallocated sector count as an indicator for impending disk failure; the more sectors reallocated, the nearer the disk is to failure. Unfortunately, this technique doesn’t work with SSDs because SSDs reallocate sectors during everyday use—every time a flash memory block stops working, the controller reallocates another block of flash memory to replace it. It’s not unusual for a healthy SSD to have thousands of reallocated sectors.”
So another technique needed to be used for failure prediction in SSDs, and Tim thought his team had found it: “We were excited to discover that SSDs contain a Media Wearout Indicator as one of their SMART parameters.”
Tim then described how the Media Wearout Indicator works: “Remember that SSDs have 10 – 20% extra flash memory (a 100 GB SSD actually contains 110–120 GB of flash memory). This extra flash memory is used to replace flash memory blocks that wear out as the SSD is used. The Media Wearout Indicator displays the amount of extra flash memory still available in an SSD. It goes from 100% when the SSD is new down to 0% when all this extra flash memory has been used up.
However, as Tim went on to explain, the Media Wearout Indicator didn’t turn out to be quite as useful as expected: “We had high hopes that this indicator would provide us with a predictive indicator for impending failure. Two years ago, we incorporated a mechanism for monitoring it into SoftRAID. Since then, we have seen no SSDs which have failed because all their extra flash memory has been consumed. All the SSDs we have seen fail have failed with the Media Wearout Indicator well above 80%. We are still trying to develop a reliable mechanism for predicting when SSDs will fail.”
After his talk, Tim spoke to Chris Bross of DriveSavers Data Recovery, Inc., who said that their experience was exactly the same. SSDs fail catastrophically and without warning, and the Media Wearout Indicator is not useful in predicting when they will fail.