Today Open vStorage marks its first birthday. Exactly 1 year ago the first commit was made for a fun, exciting and very hot Virtual Machine storage project. In the past year we have accomplished a lot. Just have a look at the released features: an OpenStack Cinder plugin, support for VMware and KVM, support for various storage backends, SNMP monitoring, support for Virtual Machine HA, …
But there is more as according to Open HUB Open vStorage has a young, but established codebase maintained by a large development team with stable Y-O-Y commits.
Some statistics of the past year:
- Open vStorage consists of more than 70.000 lines of code
- In 1 year more than 1185 commits were done
- It took an estimated 13 years of effort so far (COCOMO model)
We would also like to thank everyone that supported us: the many (brave) people that downloaded Open vStorage to give it a go, people that gave feedback and raised bugs and people that reached out to us to see how we could work together. We would like to thank you individually but there are too many of you!
Wim Provoost, Product Manager of Open vStorage, was asked to host a live webinar for OpenStack Online Meetup. OpenStack Online Meetup is the #1 online community for OpenStack contributors and users. During their weekly Google Hangout sessions they feature various OpenStack speakers. The sessions are intended for technical OpenStackers and tend to deep-dive into technology.
The Open vStorage talk discusses how you can use OpenStack Swift, the Object Storage project within OpenStack, as primary storage for a Virtual Machine environment. In case you missed the live session, you can watch the recorded version below.
Storage comes in many forms and over the years multiple strategies to store and retrieve data on disk have been implemented. For Open vStorage I/O is the write or read operation on the LBA (Logical Block Address) of a Virtual Machine. Let’s first have a theoretical look at the 3 most important strategies to store data, their benefits and their drawbacks:
- Location-based storage:
Location-based storage stores the exact location where the data is placed. This means that we store in the metadata for each address the exact location in storage system where the actual value is stored. The advantage of this strategy is that it is very fast for read operations as you know the exact location of the data even if data frequently changes. The drawback is you don’t have a history: when an address gets overwritten, the location of the old value is lost as the address will contain the location of the new data. You can find this strategy in most of the storage solutions like SANs.
- Time-based storage:
Time-based storage is using time to identify when data was written. The easiest way to achieve this is by doing a log type approach to store the writes by appending all data writes to the log as a sequence. Whenever you have new data to write, instead of finding a suitable location, you simply append it to the end of the log. The advantage is that you always have the complete history (all writes are appended) of the volume and snapshots are also very easy to implement by ending the log and starting a new log file. The drawback is after a while data gets spread across different log files (snapshots) so reads become slower. To find the latest value written for an address, you need to go from the last log file to the first to identify the last time an address was written. This can be a very timely process in case data was written a long time ago. A second problem is that the always append strategy can’t be followed indefinitely. A garbage collection process must be available to reclaim the space of data which is longer needed when it is for example out of the retention period.
- Content-addressable storage (CAS):
With CAS each write gets labeled with an identifier, in most cases a hash. The hash value is calculated in some way from the content of the stored information. The hash and the data are stored as key/value pair in the storage system. To find data back in a CAS system, you go look up the hash in the metadata and go through the hash table. When the hash matches, the data can be found behind that hash-key. One of the reasons why hashing is used is to make sure objects are only stored once. Hence this strategy is often used in storage solutions which offer deduplication. But, CAS can’t be used to efficiently to store writes or a lot of data and is only usable when data doesn’t change frequently as keeping the hashes sorted requires some overhead. That is why it is mostly used in caching strategies as data doesn’t change quite often in the cache.
So far the theory, but which strategy does Open vStorage use? When we designed Open vStorage, we wanted storage that has great read and write performance and gives us a history of the volumes so easy snapshots, zero-copy cloning and other features come out of the box. Taking a single one of the above strategies was not an option as all of them have benefits but more importantly have drawbacks. That is why Open vStorage combines all of them as the benefits of one strategy is used to counterbalance the drawbacks of the other. To achieve great performance Open vStorage uses the SSDs or PCIe flash cards inside the host for caching. The read cache is implemented as a CAS as this offers us deduplication and great performance for frequently consulted data. The write cache is implemented using a location-based approach. The storage backend is implemented using a time-based approach by aggregating writes which occurred together in time. This approach gives us features like unlimited zero copy snapshots, cloning, and easy replication.
Before we can start with a deepdive we need to explain how the basic write transaction is implemented. Open vStorage uses a time aggregated, log based approach for all writes. In case a write is received the 4k block is appended to a file, a Storage Container Object (SCO). As soon as the SCO reaches 4MB, a new file is created. Meanwhile, in a transaction log (TLOG), for every write the address, the location (a combination on the SCO name and the offset within the SCO) and a hash of the data is saved. These SCOs and TLOGS are stored on the storage backend when they are no longer required on the SSD inside the host.
Let’s now put everything together …
- Location-based storage
The write caching works as a transaction log based cache on fast Flash or SSD redundant storage. In this transaction log we store the address, the location and a hash. The actual write cache is accomplished by filling up Storage Container Objects (SCO), a file containing a sequence of 4k blocks, which turns any random write I/O behavior into a sequential write operation. During each write, the address of the 4k block, the hash, the SCO number and the offset are stored as metadata in the metadata lookup database. As the metadata contains the exact location of the data of an address, the SCO and its offset in the SCO, it is evident that this is an location-based approach. But why do we also store a hash?
- Content-addressable storage
When a read request is done, the Storage Router will look up the hash in the metadata which contains the latest state of the volume for each address and will see if that hash is available in the read cache. This read cache is CAS (content-addressable storage) and stores hash/value combinations on SSD or flash storage. If it exists it will serve the read requests directly from the SSD or flash storage, resulting in very fast read I/O operations. Since most of the reads will be served from the cache, the content of this cache doesn’t change very often so we don’t have a large penalty to maintain the hash table. Moreover, by using hashing we can even make better use of the SSD as it allows us to do content based deduplication. In case the data is not in the read cache, but on the write cache as it was only recently written, we can still quickly retrieve it as the metadata also stores the exact SCO it is in and the offset within that SCO.
In case the requested address is not in the read or write cache we need to go to the storage backend which is time-based.
- Time-based storage
The Storage Router writes or reads the data using SCO’s and transaction logs when it is communicating with the backend. By adding the data writes to the SCO’s in an log-structured, append only way, data which needs to be evacuated from the write cache, is pushed as an object (a SCO) to the storage backend. Next to the SCOs, the transaction logs containing the sequence of the writes, the address, the location and offset and the hash, are also stored on the backend. The combination of the always append strategy and the address means we have a complete history of all writes done to the volume.The benefit of this approach is that the time-based approach gives us enterprise features like zero copy snapshots and cloning. Time-based storage also requires maintenance to compact older SCO’s or cleanup deleted snapshots. By having all transaction logs and SCO’s stored on the backend, maintenance tasks can totally be offloaded from the Storage Router on the host. The Scrubber, a process that does the maintenance of the time-based storage, can work totally independent from the Storage Router, as it has access to all transaction logs and SCO’s stored on the backend storage. Once the Scrubber has finished, it will create an updated set of transaction logs that is being used by the Storage Router to update the local metadata, and to delete the obsolete SCO’s on the backend. Because of the caching in the Storage Router, the maintenance work does not impact performance because most read and write I/O requests will be using the read and write cache.
- In the event of a disaster where the complete volume is lost, the volume can be rebuilt from the storage backend only on another host. The only thing which needs to be done to get the volume back to its latest state is to get all transaction logs from the backend and replay them so the metadata contains for each address the latest location of the data. When a read request comes, you only need to fetch the correct SCO from the backend and put it in the read cache for quick access.
Open vStorage uses different approaches when data is stored and read. On the frontend, on the SSDs inside the host, we use a content-based read cache which offers performance and deduplication across volumes. The write cache makes sure that data is quick written in an always append mode. A location-based cache is used for this approach so a miss in the read cache can be quickly covered in the write cache if it is recent data. When data is no longer needed in the write cache it gets pushed in a time-aggregated fashion (SCO) to the backend. When this happens the transaction logs are also pushed to the backend. As the backend is implemented using a time-based approach, snapshots, zero-copy cloning and easy replication come out of the box.
In our last blog post we discussed the Open vStorage Cinder Plugin. We had some people ask whether this means that you can now use Swift directly as primary storage for Virtual Machines. The short answer is: YES!
A traditional OpenStack setup looks like the below reference architecture.
You have Nova which provisions the VM. Cinder is providing block storage and Glance provides the image to deploy the VM. Swift is also used but only as repository for images and backups. You can use Cinder natively but this isn’t high available and isn’t an enterprise grade solution so you need ‘something distributed’ to actually store the blocks. This can be a SAN (Dell, HP, EMC, …) or a Ceph distributed storage platform. This means you are maintaining 2 storage platforms: 1 for object storage (Swift) and 1 for block storage (SAN, Ceph, …).
Maintaining one storage platform is already hard enough so why would you maintain two? This is where Open vStorage comes to the rescue. It allows to turn OpenStack Swift into block storage. This means you can now use Swift both for object and for block storage. The only thing you need is to install Open vStorage and configure its Cinder Plugin. When you create a volume in OpenStack, the Cinder API will call the Open vStorage API to create a disk. The same happens when a snapshot is created. On top, Open vStorage also brings VM-centric storage management to OpenStack.
Interested in seeing how the Cinder Plugin works? Check the demo video below:
The source code for the plugin can be found here.
The steps to set up an environment can be found here.