Easily Visualize Your Dedupe and Compression Results with PureSize
[Hi, I'm Chris Golden, one of the Software Engineers here at Pure Storage. Among other things, I'm one of the engineers responsible for PureSize. Prior to joining Pure Storage, I was a Software Engineer at IronPort Systems / Cisco.]
Evaluating new technologies can be difficult, but not with Pure Storage. I’m going to show you how our tool, PureSize, make it super simple to size your very own FlashArray.
Before we begin, you should read @PureLu‘s excellent post on why everyone should run PureSize. It also includes sample output from PureSize, which I won’t reproduce here. Instead, I’m going dive into Pure Storage’s distinct approach to data reduction and show how PureSize reveals the hidden reduction potential in your data.
The Dataset
PureSize scans your dataset and simulates all of the data reduction steps performed by Purity, the software powering the Pure Storage FlashArray. Once finished, PureSize produces a detailed report that shows you the data reduction you can expect.
To visualize the results of PureSize, we’re going to need a sample dataset. I’ve reproduced a very small (1024 sectors / 512KB ) dataset to illustrate how the PureSize tool interprets and reduces data.
The original dataset is illustrated at right, with each sector colorized based on its data composition.
- Zeros (blue sectors labeled ’0′) – 512-byte blocks filled with zeros.
- Patterns (orange sectors) – 512-byte blocks filled with a single repeating byte, i.e. ‘\xff’.
- Duplicate blocks (green sectors labeled ‘D’) – 512-byte blocks that contain identical data to another sector in the dataset.
- Unique blocks (red sectors labeled ‘U’) – 512-byte blocks that contain unique sequences of bytes.
If you’ve already run PureSize, this is the ‘Data + Zeros’ dataset.
Why 512 Byte Block Size?
To achieve its amazing results, Purity uses a 512-byte block size. This gives Purity the ability to reduce data more effectively than other architectures. To the left, I’ve colorized the same dataset using a 4KB block size. As you can see, our sample dataset turned completely opaque.
It’s the difference between so-so data reduction and great data reduction.
Thin Provisioning
Modern storage arrays, Pure Storage included, are thin provisioned, as not all client applications consume all storage allocated to them. Reclaiming the unused space for real data leads to significant savings.
In Purity, 1MB blocks filled with zeros are treated as unmapped data and therefore require zero storage overhead; PureSize applies 0 cost for thin provisioned blocks. For the purposes of illustration, I’m using a 16KB thin provisioned block size.
With Thin Provisioning, PureSize can remove 192 sectors from the physical storage cost of this dataset.
PureSize calls this the ‘Data-Only’ dataset.
Most storage arrays stop here, but we’re only just getting started.
Pattern Removal
Repeating patterns are common in many datasets. Databases and filesystems use repeating patterns to pad out data structures and reserve space. Patterns are also common in VM Memory images, where large ranges of memory are initialized with known patterns for correctness checks.
Purity reduces each pattern sector to a single metadata mapping. Therefore, PureSize calculates the cost of each pattern sector to be a single metadata mapping.
Pattern detection allows PureSize to remove 179 sectors from the physical cost.
However, because pattern sectors require metadata mappings, PureSize adds a very small overhead cost for metadata. During this step, PureSize added 1 sector for overhead (bottom right of the image, you can click to zoom in).
Duplicate Sector Removal
Datacenter consolidation results in duplicated data, particularly when storing VMs on the same array. In fact, our best data reduction success stories come from VDI deployments, where there are 10s or 100s of nearly identical machine images. Databases also dedupe well, particularly when storing multiple copies on the same array, as is common in test environments.
Purity uses inline deduplication to detect and reduce duplicated sectors to a single metadata mapping. When PureSize detects a sector that contains identical data to a previously seen sector, it reduces the cost of the newer sector to a single metadata mapping.
During this step, PureSize can remove 367 sectors from the physical cost. Our metadata has grown by 2.5 sectors to track the new mappings.
If you’re planning on putting multiple copies of your database or VM images on the same FlashArray, make sure to scan all copies in the same PureSize run. Your results will be more accurate.
Compression
Many applications, databases in particular, write data that’s highly compressible. However, users are loathe to enable compression in their application, either because of complexity, performance, or even licensing costs.
For sectors that cannot be removed via other data reduction techniques, Purity still has a trick up its sleeve: dynamic compression. At runtime, Purity has several compression algorithms to choose from and chooses the best algorithm based on a number of factors.
By default, PureSize samples 10% (configurable) of the unique sectors to estimate compressibility of the unique sectors and generates an average compression ratio. PureSize then estimates the cost of each unique sector to be (512-bytes * average compression ratio) + metadata overhead.
Here, PureSize subtracts 214.5 sectors from the physical cost and adds 4 sectors for metadata.
If you’re thinking of putting multiple applications onto the same FlashArray, make sure to scan all applications’ data in the same PureSize run. Compression ratios can vary greatly from application to application.
RAID-3D, FlashCare
The final step, simulating Pure Storage’s resilience techniques, adds overhead for RAID-3D. Like compression, Purity’s RAID-3D is dynamic, so PureSize uses a conservative 28% overhead on the remaining unique data and metadata.
FlashCare is Pure Storage’s proprietary method for getting the best performance and lifetime out of SSDs. FlashCare decides, based on performance, age, and several other factors, where to write each piece of data. PureSize does not have any output related to FlashCare, as FlashCare does not actually alter the physical size of the dataset. However, I’ve shown how FlashCare might rearrange our sample dataset for flash; this is how Purity would persist our sample dataset.
Including RAID-3D, PureSize determines the expected cost of our dataset to be 99 sectors. Written to a Pure Storage FlashArray, our example data would achieve a 11.8 to 1 “Data-Only” reduction ratio and a 12.8 to 1 “Data+Zeros” reduction ratio.
What’s Your Number?
Armed with your data reduction number, scaling your own FlashArray couldn’t be easier. Simply multiply your data reduction number by the raw capacity of the array and you have your expected capacity. For example, if your data reduction number is 5 to 1 and you bought an 11TB FlashArray, then you’ll get 55TB of usable space.
Now go, download PureSize and learn your number!
Questions? Proud of your results? Leave a comment or find me on Twitter.




