Why Flash Changes Everything, Part 3
I have been justly accused of belaboring the top ten list format in the past, but I’ve reached new lows by starting one in March (Why Flash Changes Everything, Part 1), continuing it in May (Part 2) only to now finally finish it in August. In my defense, Part 3 of Why Flash Changes Everything was too juicy to share in advance of our company launch today.
Without further adieu, the #1 reason Why Flash Changes Everything is
(1) High-performance data reduction.
Let me explain. Data reduction techniques like deduplication and compression have been around disk storage for many years, but so far have failed to get traction for performance-intensive workloads. Why?
Deduplication effectively entails replacing a contiguous block of data with a series of pointers to pre-existing segments stored elsewhere. On read, each of those pointers represents a random I/O, requiring many disks to spin to fetch the randomly-located pieces of data to construct the original data. Disk is hugely inefficient on random I/O (wasting >95% of time and energy on seeks and rotations rather than data transfers). Better to keep the data more contiguous, so that disk can stream larger blocks on/off once the head is in the right place.
There is also the challenge of validating duplicates on write. If hashing alone is used to verify duplicate segments, then a hash collision, albeit very low probability, could cause data corruption. For backup and archive datasets, such risk is more acceptable than it is for primary storage. In our view, primary storage should never rely on probabilistic correctness. This is why Pure Storage always compares candidate dedupe segments byte for byte before reducing them, but this too leads to random I/O.
Since deduplication depends upon random I/O, with disk it inevitably leads to spindle contention, driving down throughput and driving up latency. No wonder dedupe success in disk-centric arrays has been limited to non-performance intensive workloads like backup and archiving.
With flash there is no random access penalty. In fact, random I/O may be even faster as it enlists more parallel I/O paths. And with flash writes more expensive than reads, dedupe can actually accelerate performance and extend flash life by eliminating writes.
Compression presents different challenges. For a disk-array that does not use an append-only data layout (e.g., the majority of SANs rely on update in place), compression complicates updates: read, decompress, modify, recompress, but now the result may no longer fit back into the original block. In addition to accommodating compression, append-only data layouts (in which data is always written to a new place) are generally much friendlier to flash as they help avoid flash cell burn-out by amortizing I/O more evenly across all of the flash. (Placing this burden instead on an individual SSD controller leads to reduced flash life for those SSDs with hot data.)
Finally, flash is both substantially faster and more expensive than disk. For backup, data reduction led to a media change—swapping disk for tape—by making disk cost-effective. Over the next decade the same thing is going to happen in primary storage. Pure Storage is routinely achieving 5-20X data reduction on performance workloads like virtualization and databases (all without compromising submillisecond latency). At 5X data reduction, the data center MLC flash we use hits price parity with performance disk (think 15K Fibre Channel or SAS drives). At 10X, which we routinely approach for our customer database workloads, we are about half the price of disk. 15X, a third. And at 20X, which we typically approach for our customer’s virtualization workloads, we are roughly one quarter the cost!
In retrospect, then, high-performance data reduction is an obvious #1 reason Why Flash Changes Everything: Flash provides the additional IOPS capacity necessary to make data reduction feasible for performance workloads. And in turn, data reduction is going to enable solid-state flash to replace mechanical disk for literally all of the random I/O performance workloads in the data center by making it cost effective to do so.
With flash faster, more space and power efficient, more reliable, and cheaper than disk, why buy disk?