Dedupe Retrofitting Fables and Foibles
We’ve long argued at Pure Storage that deduplication for primary storage is quite different than backup deduplication, and it is very difficult to retrofit deduplication to legacy storage arrays that weren’t designed for it. For the latest chapter in this series, I invite you to read one of the more entertaining news articles of the week from Carol Silwa at TechTarget: Roadmap for Dell Fluid File System, Data Reduction Revealed. You really should just read the article for yourself, it’s worth it…but the gist of it you can get from this quote:
“We were supposed to have Ocarina on the Dell Fluid File System. Didn’t happen. Some problems cropped up, and they had to go back to the drawing board and come up with a different way of doing integration. That’s at least a year behind schedule. It was supposed to come out at the very, very end of 2011, and now it’ll come out in early 2013.” –Dell
So why is retrofitting dedupe to existing storage arrays so hard? There are really three key problems:
- Metadata management – legacy controllers weren’t designed for it. Dedupe and compression require large amounts of metadata, and the smaller and more flexible the chunk size you use to detect and store similarities, the more metadata it produces. Because disk is so slow, legacy arrays have to cache this metadata in the DRAM space of their controllers if they have any hope of reasonable performance, and those controllers are already generally running pretty busy as they were…so adding the metadata for dedupe is non-trivial, and even harder if you want any level of reasonable performance.
- Storing dedupe data requires highly-virtualized back end data structures. Legacy arrays tended to have a fairly fixed data layout, volumes were mapped to RAID structures with fairly rigid mappings in fixed-sized chunks or pages. But dedupe and compression break all that. Dedupe requires the ability to flexibly store pointers and indirections, and compression means that a given block of data will take a variable amount of space – both on its first write and any subsequent modifications.
- Dedupe with performance on disk is hard. Most disk arrays rely on serialization of data and attempting to keep data nicely organized to keep disks spinning not seeking…but dedupe breaks that, inserting pesky pointers and breaking-up that well-ordered write stream. Simply: dedupe turns all IO into random IO, the most toxic type of IO for a traditional disk array.
This second issue is further articulated by Dell in the article:
“Both [EqualLogic and Compellent] have some fundamental work to do to be able to put dedupe and compression in, and it has to do with how they manage pages. When you’re writing blocks to an array, you’re writing 4K chunks. But arrays don’t usually store those separately as 4K chunks. They put ‘em in pages. A page is just an internal construct. Every vendor has different ways they do this. The first thing that has to happen is you have to have a way to make your fundamental unit of storage variable size instead of fixed. That’s a big job, so they’re actually plumbing in…Nothing’s coming out in 2012, but in 2013, you’ll see Ocarina for Compellent first and then EqualLogic.” –Dell
So you see, it’s just not that easy.
Let’s examine a quick history of dedupe retrofitting in the storage industry to see how all are faring…
- NetApp: NetApp was the first vendor to introduce primary storage deduplication in their FAS systems in 2007, and this somewhat “started the clock” for the rest of the industry to catch them. NetApp had a big advantage in WAFL: it was a highly-virtualized storage operating system, that already had many of the primitives around pointers and indirection required for adding dedupe. (Note that despite this early lead, NetApp’s dedupe remains post-process, relatively low performance, and at the 4K block size, but that’s another post for another day.)
- EMC: Despite EMC acquiring Data Domain in 2009 (and let’s not forget Avamar in 2006) giving it the leading backup deduplication technology in the market, now three years later and many failed internal projects we have still yet to see deduplication integrated into either of the flagship Symetrix/VMAX or VNX/CLARiiON product lines (to be fair Celerra/VNX does have file-level dedupe, but that’s a different ball of wax). One only has to look at the large chunk size (multi-MB for Symmetrix and GB for VNX) that EMC’s FAST tiering technology uses to move data between flash and disk to see the limitations of their metadata management.
- Dell: Acquired Ocarina in 2010 and announces an aggressive roadmap to get it integrated across the Dell portfolio on multiple occasions. As you can see above, 2012 is out and 2013 is the “plan,” requiring a complete back-end rearchitecture.
- IBM: Acquired Storwize in 2010 (compression not deduplication), and recently announced availability on the V7000 platform. Compared to their competitors this is pretty fast, but IBM had two advantages: the V7000 is based on SVC, which is a highly-virtualized architecture, and compression is much easier to retrofit than dedupe, as it requires much less metadata overhead.
- 3PAR/HP: One would imagine that given the highly-virtualized architecture of 3PAR with their thin provisioning, adding deduplication/compression would be relatively easy. There have been rumors over the years, but nothing shipping on either the compression or deduplication front, and no announced plans.
- Violin Memory: Not a traditional storage player, but a newer flash vendor who recently announced that they are also undergoing the dedupe retrofitting task to their flash arrays through an OEM arrangement. While flash has some inherent advantages around speed when it pertains to dedupe, the hard part is still layering-on the virtualization and metadata management, which is quite difficult if your flash array wasn’t designed for it. In particular this has lead Violin to adopt a post-process dedupe strategy where writes are committed to flash, re-read, deduped, and re-written…which misses one of the key advantages of dedupe in a flash array: write avoidance to extend the life of MLC (more on this in a coming blog post).