The analyst industry is telling us that unstructured data growth is going to outpace that of transactional based data. "While transactional data is still projected to grow at a compound annual growth rate of 21.8%, it’s far outpaced by a 61.7% CAGR predicted for unstructured data in traditional data centers." You don't have to look far past your own explosion of data consumption to realise this is becoming a large problem for IT departments. Combined with this growth is our desire to keep more aged data online, in order to provide much faster retrieval.
What is one to do? Well a company called Ocarina Networks says they "make free space on storage you already have" through some very clever content aware compression and de-duplication. The key element here is that it works on your online storage so the savings to save are multiplied as there are flow on effect to the transmission of your data over networks and to the amount of data that you need to backup. So even though companies like Netapp (which Ocarina say they are 57x better than) and DataDomain do de-dupe its only at the underlying storage without these possible secondary benefits.
A quick look at just three of the people involved in Ocarina gives you a good impression that they have the pedigree to achieve great things here. Their CEO, Murli Thirumale and CTO, Goutham Rao, hail from the same roles in the Citrix Advanced Solutions Group, where they led the SSL-VPN division (acquired via Net6). In those roles they took their technology to the number #1 unit in market share in eighteen months. The Chief Scientist, Dr Matt Mahoney is a thought leader in next generation data compression. Also as a company they have been very busy in creating some interesting patents.
Last week at the Gestalt IT Field Day I got some deep dive into the Ocarina technology. Here is a video I took of Goutham and Murli.
However the insights from these guys on the science of de-dupe and compression was very informative, so lets look at what they had to say in more detail.
There are two approaches to compressing data, either a dictionary or a statistical approach. A dictionary encoder approach, such as the LZ algorithm, "operate[s] by searching for matches between the text to be compressed and a set of strings contained in a data structure (called the 'dictionary') maintained by the encoder. When the encoder finds such a match, it substitutes a reference to the string's position in the data structure."
The statistical approach is much more interesting. If you can predict what is coming next in a data series, you don't need to record it, you only need to record the things you did not expect (this is what takes up the space). As long as you use the same algorithm to extract the exception data you get exactly the same data (or file) whilst only saving a very small part of it. You can also have a feedback loop from from the errors back into the input to improve the prediction. For example if you look at a photo of the room you are sitting in now, there are probably lots of boarders or edge framed objects or walls etc. If you turned all of these edges into axis's and you were to follow an axis of colour moving down the edge of the wall you can expect that the next element moving down will be more of that same edge, you only need to record something when its not. Complex but you can do some clever things with the right algorithms [more on that shortly].
Compression is something you can only do on a single file. As mentioned the key to compression is predicting what the next value is going to be in an incoming stream of data. The more data you have available in the incoming data steam the better you may be able to predict the next value. Also note that a lot of file types being generated today are already compress internally, such as JPEG images either by themselves or embedded inside other documents.
De-dupe is all about finding the similar chunks of data by comparing hash values or a fingerprint. The smaller the chunks you are comparing the better because it increases the likelihood of a match between the two. Dividing the data into fixed chunks will get you so far but unless you have really small chunk you can miss a match that might occur across the boundary of two chunks. Netapp de-dupe does it this way. To get maximum effect you need what is called a sliding chunk window, looking for a matching bit of data anywhere, yet this is expensive computationally as you have to calculate a lot more hash values. There is a risk that two different chunks may produce the same hash or fingerprint, a false positive. Typical hashing algorithms are MD5, which is very weak or SHA256 which is strong, but Rabin [http://en.wikipedia.org/wiki/Rabin_fingerprint] is most liked [its fast to implement in software and works well on sliding windows]. How does all this comparing of chunks of data save you data? When you find a duplicate chunk you don't need to save a second copy, you can just save a small reference to the original piece of data you already have. Some technologies, such as Microsoft Storage Server 2008 do single instance storage (de-dupe) by only comparing whole files, which is bit of a joke really, it not going to get you much saving, because these days we create so many copies of the same files which are only slightly different (we add a few words to a document but save it as a new file name) or there is a lot of repetitive elements across files (images and templates). Yet this technique is really easy to do. Lastly, not all data can be de-duped, some just has very little if any repetition.
Now it also matters what you are de-duping, is it a data moving over a network, a backup or your storage. Each of these has a different "window" of time that they are looking at. On a network transfer you don't have much of a window and the data in that short window may not be very repetitive, whereas a backup has a very long window with repeated cycles of data coming in that is probably very repetitive. These different characteristics of the data stream require different algorithms to achieve greatest efficiencies.
Compression does not preclude de-dupe but they do pull against one another. For example as mentioned earlier a lot of data is already compressed and compressed data removes just about any chance of finding duplicate chunks of data. If you are a photo storing site you probably want to turn de-dupe of and not waste all the effort. Likewise in a corporate environment you may have millions of occurrences of your company logo image but they are all compressed and embedded inside Word and Powerpoint files that are then also compressed. All that repetitive data has been obfuscated! Remember, all that growth in storage is in this unstructured data area.
Yet you want both de-dupe and compression, because there is always data you need save so compress it.
So given this primer what do Ocarina do? Well Ocarina find the optimal chunk size for everything, compression and de-dupe, by performing object chunking. If you take all of the data and break it into objects, so a zip file is broken down into its multiple files, a Word document may be broken down into images and text. Then the actions occur at the object level. Hence a jpeg would not be broken down into smaller chunks, as the best windows size to compress or de-dupe a jpeg is the whole image.
Going beyond the object based chunking Ocarina then use a neural network to determine what the best compression algorithm is for this particular type of chunk, in fact they have over 120 different algorithms. There are even different algorithms for variations of the same object, such as for a small versus a large jpeg. Their algorithms range from plain text to gene sequences. For images they have some very smart algorithms that perform spatial optimization or what can your eye see, based on chrominance and luminance. If you take a typical scenario it helps to understand the power of this. If you have the same photo at different sizes, or if you slightly adjust a photo (such as removing the red eye) the data on the disk is all very different and there is probably no repetition across them. However because Ocarina can "look" at the image it is able to determine that they are all in fact the same photo.
How does all of this work? Well an appliance accesses your storage and process the data. It breaks files down into their objects, weaves it magic and puts the smaller shrunk version back. This all occurs in RAM. To be safe, before it replaces the file it compares the original file with an expansion of the shrunk file to ensure they match exactly so there are no errors. Of course the files on the storage are now different, so you need to use the ECOreader (a file system filter driver) which expands the files in real time as they are read so you get them back in their original format. Of course sometimes you may want to read the shrunk file and not expand it, for example if you want to transmit it over a network (replication) or for backup. The software can be integrated into storage to make it all transparent to the user. Performance when reading and expanding is on par for de-dupe, for compression its dependent on the method but usually the same rate to uncompress as it was to compress it. Essentially you are performing an economic tradeoff of consuming compute cycles for disk capacity gains.
Having reviewed all of this organisations which are having to store, transmit and backup large amounts of unstructured data could benefit a lot from the Ocarina technologies. Especially those that the Ocarina algorithms work well. From speaking to them they are working hard on new and improved algorithms but just as importantly on how to make the technology solution work well.
You can find more details about the products on the web site http://www.ocarinanetworks.com/