The Open vStorage High Performance Read Mesh (HPRM)

When you are developing a storage solution your biggest worry is data loss. As an Open vStorage platform can lose a server or even a complete data center without actual data loss, we are pretty sure we have that base covered. The next challenge is to make sure that safely stored data can be quickly accessed when needed. In this blog section we already discussed a lot of the performance improvements we made over the past releases. We introduced the Edge component for guaranteed performance, the accelerated ALBA as read cache, multiple proxies per volume driver and various performance tuning options.

Today it is time to introduce the latest performance improvement: High Performance Read Mesh (HPRM). This HPRM is an optimization of the read path and allows the compute host to directly fetch the data from the drives where the data is located. Earlier the read path always had to go through the Volume Driver before the data was fetched from the ASD. This newly introduced short read path can only be taken in case the Edge has the necessary metadata of where (SCO, fragment, disk) each LBA’s data is stored. In case the Edge doesn’t have the needed metadata, for example because the cached metadata is outdated, the slow path is taken through the Volume Driver. For the write path nothing is changed as all writes go through the Volume Driver.

The short read path which bypasses the Volume Driver has 2 direct advantages: lower latency on reads and less network traffic as data only goes once over the network. Next, the introduction of the HPRM also allows for a cost reduction on the hardware front. Since the hosts running the Volume Driver are no longer in the read path in many cases, they are freed up and can focus on processing incoming writes. This means the ratio between compute hosts running the Edge and the Volume Driver can be increased. Since the Volume Driver hosts are typically beefy servers with expensive NVMe devices for the write buffer and the distributed databases, a significant change in the Compute/Volume Driver ratio means a significant reduction of the hardware cost.

HPRM, the technical details

Let’s have a look under the hood on how the HPRM works. First we will have a look at the write path. The application, f.e. the hypervisor, writes to the block device exposed by the Edge client. The Edge client will connect to its server part which in its turn, writes the data to the write buffer of the Volume Driver. Once enough writes are accumulated in the buffer, a SCO (Storage Container Object) is created and dispatched to the ALBA backend through the proxy. The proxy makes sure the data is spread across different ASDs according to the specified ALBA preset. Which ASDs contain the fragments of the SCO is stored in a manifest.
Once a read comes for the LBA, the Edge client will check its local metadata cache for the SCO info and manifest of the SCO. If the info is available the Edge will get the LBA data through the PRACC (Partial Read ACCelerator) client which can directly fetch the data from the ASDs. If the info isn’t available in the cache or if it is outdated, the manifest and SCO info are retrieved by the Edge client from the Volume Driver and stored in the Edge metadata cache.
The Edge also pushes the IO statistics to the Volume Driver so these can be queried by the Framework or the monitoring components. Gathering IO statistics is done by the Edge as it is the only component that has a view on both the fast path, through the PRACC, and the slow path through the Volume Driver.


Note that the High Performance Read Mesh is part of the Open vStorage Enterprise Edition. Contact us for more info on the Open vStorage Enterprise Edition.

Fargo GA

After 3 Release Candidates and extensive testing, the Open vStorage team is proud to announce the GA (General Availability) release of Fargo. This release is packed with new features. Allow us to give a small overview:

NC-ECC presets (global and local policies)

NC-ECC (Network Connected-Error Correction Code) is an algorithm to store Storage Container Objects (SCOs) safely in multiple data centers. It consists out of a global, across data center, preset and multiple local, within a single data center, presets. The NC-ECC algorithm is based on forward error correction codes and is further optimized for usage with a multi data center approach. When there is a disk or node failure, additional chunks will be created using only data from within the same data center. This ensures the bandwidth between data centers isn’t stressed in case of a simple disk failure.

Multi-level ALBA

The ALBA backend now supports different levels. An all SSD ALBA backend can be used as performance layer in front of the capacity tier. Data is removed from the cache layer using a random eviction or Least Recently Used (LRU) strategy.

Open vStorage Edge

The Open vStorage Edge is a lightweight block driver which can be installed on Linux hosts and connect with the Volume Driver over the network (TCP-IP). By creating different components for the Volume Driver and the Edge compute and storage can scale independently.

Performance optimized Volume Driver

By limiting the size of a volume’s metadata, the metadata now fits completely in RAM. To keep the metadata at an absolute minimum, deduplication was removed. You can read more about why we removed deduplication here. Other optimizations are multiple proxies per Volume Driver (the default amount is 2), bypassing the proxy and go straight from the Volume Driver to the ASD in case of partial reads, local read preference in case of global backends (try to read from ASDs in the same data center instead of going over the network to another data center).

Multiple ASDs per device

For low latency devices adding multiple ASDs per device provides a higher bandwidth to the device.

Distributed Config Management

When you are managing large clusters, keeping the configuration of every system up to date can be quite a challenge. With Fargo all config files are now stored in a distributed config management system on top of our distributed database, Arakoon. More info can be found here.

Ubuntu 16.04

Open vStorage is now supported on Ubuntu 16.04, the latest Long Term Support (LTS) version of Ubuntu.

Smaller features in Fargo:

  • Improved the speed of the non-cached API and GUI queries by a factor 10 to 30.
  • Hardening the remove node procedure.
  • The GUI is adjusted to better highlight clusters which are spread across multiple sites.
  • The failure domain concept has been replaced by tag based domains. ASD nodes and storage routers can now be tagged with one or more tags. Tags can be used to identify a rack, site, power feed, etc.
  • 64TB volumes.
  • Browsable API with Swagger.
  • ‘asd-manager collect logs’ identical to the ‘ovs collect logs’.
  • Support for the removal of the ads-manager packages.

Since this Fargo release introduces a completely new architecture (you can read more about it here) there is no upgrade possible between Eugene and Fargo. The full release notes can be found here.

The Edge, a lightweight block device

edge block storageWhen I present the new Open vStorage architecture for Fargo, I almost always receive the following Edge question:

What is the Edge and why did you develop it?

What is the Edge about?

The Edge is a lightweight software component which can be installed on a Linux host. It exposes a block device API and connects to the Storage Router across the network (TCP/IP or RDMA). Basically the applications believes it talks to a local block device (the Edge) while the volume actually runs on another host (Storage Router).

Why did we develop the Edge?

The reason why we have developed the Edge is quite simple: componentization. With Open vStorage we are mainly dealing with large, multi-petabyte deployments and having this Edge component gives additional benefits in large environments:

Scalability

In large environments you want to be able to scale the compute and storage part independently. In case you run Open vStorage hyper-converged, as advised with earlier versions, this isn’t possible. This has as consequence that if you need more RAM or CPU to run VMs, you had to also invest in more SSDs. With the Edge you can scale compute and storage independent.

Guaranteed performance

With Eugene the Volume Driver, the high performance distributed block layer, was running on the compute host together with the VMs. This results in the VMs and the Volume Driver fighting for the same CPU and RAM resources. This is a typical issue with hyper-converged solutions. The Edge component avoids this problem as it runs on the compute hosts (and requires only a small amount of resources) and the Volume Drivers runs on dedicated nodes and hence provides a predictable and consistent amount of IOPS to the VMs.

Limit the Impact of Updates

Storage software updates are a (storage) administrator’s worst nightmare. In previous Open vStorage versions an update of the Volume Driver required all VMs on that node to be migrated or brought down.With the Edge the Volume Driver can be updated in the background as each Edge/compute host has HA features and can automatically connect to another Volume Driver on request without the need of a VM migration.

Fargo RC2

We released Fargo RC2 . Biggest new items in this release:

  • Multiple performance improvements such as multiple proxies per volume driver (the default amount is 2), bypassing the proxy and go straight from the volume driver to the ASD in case of partial reads, local read preference in case of global backends (try to read from ASDs in the same datacenter instead of going over the network to another datacenter).
  • API to limit the amount of data that gets loaded into the memory of the volume driver host. Instead of loading all metadata ofa vdisk into RAM, you can now specify the % it can take in RAM.
  • Counter which keeps track of the amount of invalid checksum per ASD so we can flag bad ASDs faster.
  • Configuring the scub proxy to be cache on write.
  • Implemented timeouts for the volume driver calls.

The team also solved 110 issues between RC1 and RC2. An overview of the complete content can be found here: Added Features | Added Improvements | Solved Bugs

Dedupe: The good the bad and the ugly

the-good-the-bad-and-the-uglyOver the years a lot has been written about deduplication (dedupe) and storage. There are people who are dedupe aficionados and there are dedupe haters. At Open vStorage we take a pragmatic approach: we use deduplication when it makes sense. When the team behind Open vStorage designed a backup storage solution 15 years ago, we developed the first CAS (Content Addressed Storage) based backup technology. Using this deduplication technology, customers required 10 times less storage for typical backup processes. As said, we use deduplication when it makes sense and that is why we have decided to disable the deduplication feature in our latest Fargo release.

What is deduplication:

Deduplication is a technique for eliminating duplicate copies in data. This is done by identifying and fingerprinting unique chunks of data. In case a duplicate chunk of data is found, it is replaced by a reference or pointer to the first encountered chunk of data. As the pointer is typically smaller than the actual chunk of data, the amount of storage space to store the complete set of data can hence be reduced.

The Good, the Bad, the Ugly

The Good
Duplication can be a real lifesaver in case you need to store a lot of data on a small device. The deduplication ratio, the amount of storage reduction, can be quite substantial in case there are many identical chunks of data (think the same OS) and if the size of the chunks is a couple of multitudes larger than the size of the pointer/fingerprint.

The Bad
Deduplication can be CPU intensive. It requires to fingerprint each chunk of data and fingerprinting (calculating a hash) is an expensive CPU instruction. This performance penalty will introduce additional latency in the IO write path.

The Ugly
The bigger the size of the chunk, the less likely chunks will be duplicates as even the smallest change of a bit will make sure the chunks are no longer identical. But the smaller the chunks, the smaller the ratio between the chunksize and the fingerprint. This has as consequence that the memory footprint for storing the fingerprints can be large in case a lot of data needs to be stored and the chunk size is small. Especially in large scale environments this is an issue as the hash table in which the fingerprints are stored can be too big to fit in memory.

Another issue is the fact the hash table might get corrupt which basically means your whole storage system is corrupt as the data is still on disk but you lost the map as to where every chunk is stored.

Block storage reality

It is obvious that deduplication only makes sense in case the data to be stored contains many duplicate chunks. Today’s applications already have deduplication built-in at the application level or generate blocks which can’t be deduped. Hence enabling deduplication introduces a performance penalty (additional IO latency, heavier CPU usage, …) without any significant space savings.

Deduplication also made sense when SSD were small in size and expensive compared with traditional SATA drives. By using deduplication it was possible to store more data on the SSD while the penalty of the deduplication overhead was still small. With the latest generation of NVMe drives both arguments have disappeared. The size of NVMe drives is almost on par with SATA drives and the cost has decreased significantly. The latency of these devices is also extremely low, bringing them in range of the overhead introduced by the deduplication. The penalty of deduplication is just too big when using NVMe.

At Open vStorage we try to make the fastest possible distributed block storage solution. In order to keep the performance consistently fast it is essential that the metadata can fit completely in RAM. Every time we need to go to an SSD for metadata, the performance will drop significantly. With deduplication enabled, the metadata size per LBA entry was 8 bit for the SCO and offset and 128 bit of the hash. Hence by eliminating deduplication we can store 16 times more metadata in RAM. Or in our case, we can address a storage pool which is 16 times bigger with the same performance as compared to with deduplication enabled.

One final remark, Open vStorage still uses deduplication when a clone is made from a volume. The clone and its parent share the data upto the point at which the volume is cloned and only the changes to the cloned volume are stored on the backend. This can easily and inexpensively be achieved with 8 bits and they share the same SCOs and offsets.

Performance Tuning

At Open vStorage we now have various large clusters which can easily deliver multiple millions of IOPS. For some customers it is even a prestige project to produce the highest amount of IOPS on their Open vStorage dashboard. Out of the box Open vStorage will already give you very decent performance but there a few nuts and bolts you can tweak to increase the performance of your environment. There is no golden rule to increase the performance but below we share some tips and pointers:

vDisk Settings
The most obvious way to influence the IO performance of a vDisk is by selecting the appropriate settings in the vDisk detail page.The impact of the DTL setting was already covered in a previous blog post so we will skip it here.
Deduplication has also an impact on the write IO performance of the vDisk. In case you know the data isn’t suited for deduplication then don’t turn it on. As we have large read caches, we only set the dedupe feature on for OS disks.
Another setting we typically set at the vPool level is the SCO size. To increase the write performance you typically want to select a large Storage Container Object (SCO) size as to minimize the overhead of the creation and closing of a SCO. Also, backends are typically very good at writing large chunks of sequential data so a big SCO size makes sense. But as usual there is a trade-off. With traditional backends like Swift, Ceph or any other object store for that matter, you need to retrieve the whole SCO from the backend in case of a cache miss. A bigger SCO means in that case more read latency in case of a cache miss. This is one of the reasons why designed our own backend, ALBA. With ALBA you can retrieve a part of a SCO from the backend. Instead of getting a 64MiB SCO, we can get the exact 4k we need from the SCO. ALBA is the only object storage that currently supports this functionality. In large clusters with ALBA as backend we typically set 64 MiB as SCO size. In case you don’t use ALBA, use a lower SCO size.

Optimize the Backends
One of the more less obvious items which can make a huge difference in performance is the right preset. A preset consists out of a set of policies, a compression method (optional) and whether encryption should be activated.
You might ask why tuning the backend might influence the performance on the front-end towards the VM. The performance of the backend will for example influence the read performance in case of a cache miss. Also on writes the backend might become the bottleneck for incoming data. All writes go into the write buffer which is typically sized to contain a couple of SCOs. This is ok in case your backend is fast enough as once a SCO is full, it is ready to be saved on the backend and removed from the write buffer. This way it can make room for newly written data. In case the backend is too slow to manage what comes out of the write buffer, Open vStorage will start throttling the ingest of the data on the frontend. So it is essential to have a look at your backend performance in case it is the bottleneck for the write performance.
Since we typically set the SCO size to 64MiB and think a fragment size of 4MiB is a good size for fragments, we change the policy to have 16 data fragments. The other parameters are depending on the reliability and the amount of hosts used for storage.
Compression is typically turned on but gives distorted results when running your typical random IO benchmark as random data is hard to compress. Data which is hard to compress will even be bigger in size and hence take more time to store. Basically, if you are running benchmarks with random IO it is best to turn compression off.

In case you need help in tweaking the performance of your environment, feel free to contact us.

QEMU, Shared Memory and Open vStorage

qemuQEMU, Shared Memory and Open vStorage, it sounds like the beginning of a bad joke but actually it is a very cool story. Open vStorage secretly released in their latest version a Shared Memory Client/Server integration with the VolumeDriver (the component that offers the fast, distributed block layer). With this implementation the client (QEMU, Blktap, …) can write to a dedicated memory segment on the compute host which is shared with the Shared Memory Server in the Volume Driver. For the moment the Shared Memory client understands only block semantics but in the future we will add file semantics as to integrate an NFS server.

The benefits of the Shared Memory approach are very tangible:

  • As everything is in user-space, data copies from user to kernel space are eliminated so the IO performance is about 30-40% higher.
  • CPU consumption is about half for the same IO performance.
  • Easy way to build additional interfaces (f.e. block devices, iSCSI, … ) on top.

We haven’t integrated our modified QEMU build with Libvirt so at the moment some manual tweaking is still required if you want to give it a go:

Download the volumedriver-dev packages

sudo apt-get install volumedriver-dev

By default the Shared memory Server is disabled. To enable it, update the vPool json (/opt/OpenvStorage/config/storagedriver/storagedriver/vpool_name.json) and add under filesystem an entry “fs_enable_shm_interface”: true,. After adding the entry, restart the Volume Driver for the vPool (restart ovs-volumedriver_vpool_name).
Next, build QEMU from the source. You can find the source here.

git clone https://github.com/openvstorage/qemu.git
cd qemu/
./configure
make
sudo make install

There are 2 ways to create a QEMU vDisk:
Use QEMU to create the disk:

qemu-image create openvstorage:volume 10G

Alternatively create the disk in FUSE and start a VM by using the Open vStorage block driver:

truncate -s 10G /mnt//volume
qemu -drive file=openvstorage:volume,if=virtio,cache=none,format=raw ...

Open vStorage 2.2 alpha 1

Today we released a first version of our upcoming 2.2 release. We will from now one do more frequent releases which will cover the latest changes but keep in mind that these release have not gone through our full QA cycle. Having these ‘alpha’ versions released will allow you, the Open vStorage community, to have earlier access to new features and bugfixes but on the other hand these releases are less stable than our ‘beta’-releases. Documentation for these releases will also not always be available. In case you need help, the Open vStorage Google Groups is there to help.

What is new in the 2.2 alpha 1:

  • Huge VMware performance improvement: we have reworked our NFS integration with VMware and have made significant performance improvements (5-10x faster). Please note this is still experimental and f.e. cloning from template on VMware will in this version not work. But by all means, give it a go and let us know your experience!
  • Status of the physical devices (SSDs and SATA drives) of a Storage Router are now shown in the GUI on the Storage Router Detail page. You can also see in detail which partitions are located on which device. In a later stage we plan to make the partitioning adjustable through the GUI.
  • We have improved the performance and reduced the CPU impact of the GUI.

Small feature improvements:

  • Added check in OVS setup which disallows the possibility to rerun the setup.
  • Cinder gets automatically configured if you configured OpenStack as Hypervisor Management Center.
  • ovs-snmp port is now configurable.
  • Option to add a password when a new user is created.
  • Rename of an OpenStack volume updates the vDisk name.
  • Added the possibility to install the Open vStorage Backend packages after configuring Open vStorage.
  • Improvements to the performance of the ASDs.
  • Option to remove an Open vStorage Backend.
  • Option to define the replication factor of an Open vStorage Backend.
  • Option to enable compression for a Storage Backend.
  • ASD nodes can now be collapsed in the Backend detail page.
  • Highlight the ASDs on which an action applies in the Backend details page.
  • Impact of removing an ASD is made clear.

Bugfixes:

  • Fixed the issue where ASD’s are labeled as dead under high load.
  • Initializing a new disk (as replacement disk of a broken disk) fails.
  • Open vStorage port range 8870+ overlaps with c-api port 8876 causing n-api service on devstack to fail to restart with address already in use.
  • vDisk naming is now more consistent with the reality.
  • Fix for multiple vPools using the same read cache path.
  • Hardening vPool creation.
  • Bugfixes for various issues with the Open vStorage Backend.
  • Fix for dmesg output does not show up in syslog or kern.log
  • Failed to create an ASD if a filesystem exists on the disk.
  • Timestamps not being added in upstart logs.
  • Fix for sync disk with reality sometimes fails.
  • Incorrect permission on ovs user’s .ssh folder causes login using authorized_keys to fail.
  • Fix for issues with rabbitmqctl during install.
  • ovs collect logs doesn’t collect all logs through the GUI.
  • Metadataserver quickly fills up root partition.

How do you install this version:
When installing, add the alpha repo instead of the beta repo.

echo "deb http://apt-ovs.cloudfounders.com alpha/" > /etc/apt/sources.list.d/ovsaptrepo.list

For people using OpenStack:
Before creating a vPool, add the OpenStack controller node as Hypervisor Management Center (Admin > Hypervisor Management Center) and select all hosts on the second part of the screen. When you create a vPool, Cinder will now be automatically deployed and configured. The nova and libvirtd changes as listed in the documentation still need to be applied to the compute hosts though.

Open vStorage 1.5

During the summer the Open vStorage Team has worked very hard. With this new release we can proudly present:

  • Feature complete OpenStack Cinder Plugin: our Cinder Plugin has been improved and meets the minimum features required to be certified by the OpenStack community.
  • Flexible cache layout: in 1.5.0 you have the capability to easily configure multiple SSD devices for the different caching purposes. During the setup you can choose which SSD’s to partition and then later when creating a vPool you can select which caching device should be used for read, write and writecache protection. Meaning these can from now on be spread over different or consolidated into the same SSD device depending on the available hardware and needs.
  • User management: an admin can now create more user which have access to the GUI.
  • Framework performance: a lot of work has been put into improving the performance when a lot of vDisks and vMachines are created. Improvements upto 50% has been reached in some cases.
  • Improved API security by means of implementing OAuth2 authentication. A rate-limit has also been imposed on API calls to prevent brute force attacks.

Fixed bugs and small items:

  • GUI now prevents creation of vPools with a capital letter.
  • Implemented recommendation for a security exploit on elasticsearch 1.1.1.
  • Fix for validation of vPools being stuck on validating.
  • Protection against reusing vPool names towards the same backend.
  • Fix for the Storage Router online/offline detection which failed when openstack was also installed.

Next, we also took the first step towards supporting other OS than Ubuntu (RedHat/Centos). We have created an rpm version of our volumedriver and arakoon packages. These are tested on “Linux Centos7 3.10.0-123.el7.x86_64” and can be downloaded from our packages server. This completes a first important step towards getting Open vStorage RedHat/CentOS compatible.

Webscale 2.0

As a Product Manager I’m very often on challenging calls with potential users of Open vStorage and one of the questions that comes back on almost every call is:

How scalable is Open vStorage?

It is a question that is easy to answer: extremely scalable. Open vStorage is built from the ground up to support environments which have 100+ hosts. It is designed to be used in large datacenters as primary storage platform for all types of Virtual Machine workload. I’m aware that the term scalable is a bit biased and can have different meanings. Did the enquirer mean storage capacity scalability or performance scalability. Well, Open vStorage scales both ways. For the storage capacity, the scalability is mostly limited to the selected backend. For example, with Swift as storage backend of a vPool, you can almost infinitely add disks or storage nodes to enlarge the storage pool. Swift is after all designed with massive scalability as main development mantra and it has shown this quality in production environments of Disney and Rackspace amongst many others.
Performance scalability is also not a problem. Adding more hosts running the Open vStorage software will linearly scale the performance. As each hosts has one or more SSDs or PCIe Flash cards on board, the addition of every host to the Open vStorage environment increases the data that can be stored in the cache.

Does that mean Open vStorage is webscale?

No, unlike other hyperconverged storage solutions, we are not webscale. We are webscale 2.0. The reason why we can call Open vStorage webscale 2.0 is because it decoupled the storage scalability from the performance scalability. This allows for asymmetric architectures. It makes no sense having to add more storage capacity in order to improve the performance of your storage solution. Open vStorage is the only solution which allows to independently scale performance and capacity at a massive scale. Not only is Open vStorage tailored to the needs of large environments with petabytes of data and a battery of compute power but it can also address the needs of a typical enterprise. Whether that typical enterprise has lots of data with limited compute power or vice versa, Open vStorage is up for the job.