Location, time based or magical storage?

Storage comes in many forms and over the years multiple strategies to store and retrieve data on disk have been implemented. For Open vStorage I/O is the write or read operation on the LBA (Logical Block Address) of a Virtual Machine. Let’s first have a theoretical look at the 3 most important strategies to store data, their benefits and their drawbacks:

  • Location-based storage:
    Slide4Location-based storage stores the exact location where the data is placed. This means that we store in the metadata for each address the exact location in storage system where the actual value is stored. The advantage of this strategy is that it is very fast for read operations as you know the exact location of the data even if data frequently changes. The drawback is you don’t have a history: when an address gets overwritten, the location of the old value is lost as the address will contain the location of the new data. You can find this strategy in most of the storage solutions like SANs.
  • Time-based storage:
    Slide4Time-based storage is using time to identify when data was written. The easiest way to achieve this is by doing a log type approach to store the writes by appending all data writes to the log as a sequence. Whenever you have new data to write, instead of finding a suitable location, you simply append it to the end of the log. The advantage is that you always have the complete history (all writes are appended) of the volume and snapshots are also very easy to implement by ending the log and starting a new log file. The drawback is after a while data gets spread across different log files (snapshots) so reads become slower. To find the latest value written for an address, you need to go from the last log file to the first to identify the last time an address was written. This can be a very timely process in case data was written a long time ago. A second problem is that the always append strategy can’t be followed indefinitely. A garbage collection process must be available to reclaim the space of data which is longer needed when it is for example out of the retention period.
  • Content-addressable storage (CAS):
    With CAS each write gets labeled with an identifier, in most cases a hash. The hash value is calculated in some way from the content of the stored information. The hash and the data are stored as key/value pair in the storage system. To find data back in a CAS system, you go look up the hash in the metadata and go through the hash table. When the hash matches, the data can be found behind that hash-key. One of the reasons why hashing is used is to make sure objects are only stored once. Hence this strategy is often used in storage solutions which offer deduplication. But, CAS can’t be used to efficiently to store writes or a lot of data and is only usable when data doesn’t change frequently as keeping the hashes sorted requires some overhead. That is why it is mostly used in caching strategies as data doesn’t change quite often in the cache.Slide5

So far the theory, but which strategy does Open vStorage use? When we designed Open vStorage, we wanted storage that has great read and write performance and gives us a history of the volumes so easy snapshots, zero-copy cloning and other features come out of the box. Taking a single one of the above strategies was not an option as all of them have benefits but more importantly have drawbacks. That is why Open vStorage combines all of them as the benefits of one strategy is used to counterbalance the drawbacks of the other. To achieve great performance Open vStorage uses the SSDs or PCIe flash cards inside the host for caching. The read cache is implemented as a CAS as this offers us deduplication and great performance for frequently consulted data. The write cache is implemented using a location-based approach. The storage backend is implemented using a time-based approach by aggregating writes which occurred together in time. This approach gives us features like unlimited zero copy snapshots, cloning, and easy replication.


Before we can start with a deepdive we need to explain how the basic write transaction is implemented. Open vStorage uses a time aggregated, log based approach for all writes. In case a write is received the 4k block is appended to a file, a Storage Container Object (SCO). As soon as the SCO reaches 4MB, a new file is created. Meanwhile, in a transaction log (TLOG), for every write the address, the location (a combination on the SCO name and the offset within the SCO) and a hash of the data is saved. These SCOs and TLOGS are stored on the storage backend when they are no longer required on the SSD inside the host.


Let’s now put everything together …

  • Location-based storage
    The write caching works as a transaction log based cache on fast Flash or SSD redundant storage. In this transaction log we store the address, the location and a hash. The actual write cache is accomplished by filling up Storage Container Objects (SCO), a file containing a sequence of 4k blocks, which turns any random write I/O behavior into a sequential write operation. During each write, the address of the 4k block, the hash, the SCO number and the offset are stored as metadata in the metadata lookup database. As the metadata contains the exact location of the data of an address, the SCO and its offset in the SCO, it is evident that this is an location-based approach. But why do we also store a hash?
  • Content-addressable storage
    When a read request is done, the Storage Router will look up the hash in the metadata which contains the latest state of the volume for each address and will see if that hash is available in the read cache. This read cache is CAS (content-addressable storage) and stores hash/value combinations on SSD or flash storage. If it exists it will serve the read requests directly from the SSD or flash storage, resulting in very fast read I/O operations. Since most of the reads will be served from the cache, the content of this cache doesn’t change very often so we don’t have a large penalty to maintain the hash table. Moreover, by using hashing we can even make better use of the SSD as it allows us to do content based deduplication. In case the data is not in the read cache, but on the write cache as it was only recently written, we can still quickly retrieve it as the metadata also stores the exact SCO it is in and the offset within that SCO.
    In case the requested address is not in the read or write cache we need to go to the storage backend which is time-based.
  • Time-based storage
    The Storage Router writes or reads the data using SCO’s and transaction logs when it is communicating with the backend. By adding the data writes to the SCO’s in an log-structured, append only way, data which needs to be evacuated from the write cache, is pushed as an object (a SCO) to the storage backend. Next to the SCOs, the transaction logs containing the sequence of the writes, the address, the location and offset and the hash, are also stored on the backend. The combination of the always append strategy and the address means we have a complete history of all writes done to the volume.Slide9The benefit of this approach is that the time-based approach gives us enterprise features like zero copy snapshots and cloning. Time-based storage also requires maintenance to compact older SCO’s or cleanup deleted snapshots. By having all transaction logs and SCO’s stored on the backend, maintenance tasks can totally be offloaded from the Storage Router on the host. The Scrubber, a process that does the maintenance of the time-based storage, can work totally independent from the Storage Router, as it has access to all transaction logs and SCO’s stored on the backend storage. Once the Scrubber has finished, it will create an updated set of transaction logs that is being used by the Storage Router to update the local metadata, and to delete the obsolete SCO’s on the backend. Because of the caching in the Storage Router, the maintenance work does not impact performance because most read and write I/O requests will be using the read and write cache.
  • In the event of a disaster where the complete volume is lost, the volume can be rebuilt from the storage backend only on another host. The only thing which needs to be done to get the volume back to its latest state is to get all transaction logs from the backend and replay them so the metadata contains for each address the latest location of the data. When a read request comes, you only need to fetch the correct SCO from the backend and put it in the read cache for quick access.

Open vStorage uses different approaches when data is stored and read. On the frontend, on the SSDs inside the host, we use a content-based read cache which offers performance and deduplication across volumes. The write cache makes sure that data is quick written in an always append mode. A location-based cache is used for this approach so a miss in the read cache can be quickly covered in the write cache if it is recent data. When data is no longer needed in the write cache it gets pushed in a time-aggregated fashion (SCO) to the backend. When this happens the transaction logs are also pushed to the backend. As the backend is implemented using a time-based approach, snapshots, zero-copy cloning and easy replication come out of the box.