The Edge, a lightweight block device

edge block storageWhen I present the new Open vStorage architecture for Fargo, I almost always receive the following Edge question:

What is the Edge and why did you develop it?

What is the Edge about?

The Edge is a lightweight software component which can be installed on a Linux host. It exposes a block device API and connects to the Storage Router across the network (TCP/IP or RDMA). Basically the applications believes it talks to a local block device (the Edge) while the volume actually runs on another host (Storage Router).

Why did we develop the Edge?

The reason why we have developed the Edge is quite simple: componentization. With Open vStorage we are mainly dealing with large, multi-petabyte deployments and having this Edge component gives additional benefits in large environments:

Scalability

In large environments you want to be able to scale the compute and storage part independently. In case you run Open vStorage hyper-converged, as advised with earlier versions, this isn’t possible. This has as consequence that if you need more RAM or CPU to run VMs, you had to also invest in more SSDs. With the Edge you can scale compute and storage independent.

Guaranteed performance

With Eugene the Volume Driver, the high performance distributed block layer, was running on the compute host together with the VMs. This results in the VMs and the Volume Driver fighting for the same CPU and RAM resources. This is a typical issue with hyper-converged solutions. The Edge component avoids this problem as it runs on the compute hosts (and requires only a small amount of resources) and the Volume Drivers runs on dedicated nodes and hence provides a predictable and consistent amount of IOPS to the VMs.

Limit the Impact of Updates

Storage software updates are a (storage) administrator’s worst nightmare. In previous Open vStorage versions an update of the Volume Driver required all VMs on that node to be migrated or brought down.With the Edge the Volume Driver can be updated in the background as each Edge/compute host has HA features and can automatically connect to another Volume Driver on request without the need of a VM migration.

Fargo RC2

We released Fargo RC2 . Biggest new items in this release:

  • Multiple performance improvements such as multiple proxies per volume driver (the default amount is 2), bypassing the proxy and go straight from the volume driver to the ASD in case of partial reads, local read preference in case of global backends (try to read from ASDs in the same datacenter instead of going over the network to another datacenter).
  • API to limit the amount of data that gets loaded into the memory of the volume driver host. Instead of loading all metadata ofa vdisk into RAM, you can now specify the % it can take in RAM.
  • Counter which keeps track of the amount of invalid checksum per ASD so we can flag bad ASDs faster.
  • Configuring the scub proxy to be cache on write.
  • Implemented timeouts for the volume driver calls.

The team also solved 110 issues between RC1 and RC2. An overview of the complete content can be found here: Added Features | Added Improvements | Solved Bugs

Dedupe: The good the bad and the ugly

the-good-the-bad-and-the-uglyOver the years a lot has been written about deduplication (dedupe) and storage. There are people who are dedupe aficionados and there are dedupe haters. At Open vStorage we take a pragmatic approach: we use deduplication when it makes sense. When the team behind Open vStorage designed a backup storage solution 15 years ago, we developed the first CAS (Content Addressed Storage) based backup technology. Using this deduplication technology, customers required 10 times less storage for typical backup processes. As said, we use deduplication when it makes sense and that is why we have decided to disable the deduplication feature in our latest Fargo release.

What is deduplication:

Deduplication is a technique for eliminating duplicate copies in data. This is done by identifying and fingerprinting unique chunks of data. In case a duplicate chunk of data is found, it is replaced by a reference or pointer to the first encountered chunk of data. As the pointer is typically smaller than the actual chunk of data, the amount of storage space to store the complete set of data can hence be reduced.

The Good, the Bad, the Ugly

The Good
Duplication can be a real lifesaver in case you need to store a lot of data on a small device. The deduplication ratio, the amount of storage reduction, can be quite substantial in case there are many identical chunks of data (think the same OS) and if the size of the chunks is a couple of multitudes larger than the size of the pointer/fingerprint.

The Bad
Deduplication can be CPU intensive. It requires to fingerprint each chunk of data and fingerprinting (calculating a hash) is an expensive CPU instruction. This performance penalty will introduce additional latency in the IO write path.

The Ugly
The bigger the size of the chunk, the less likely chunks will be duplicates as even the smallest change of a bit will make sure the chunks are no longer identical. But the smaller the chunks, the smaller the ratio between the chunksize and the fingerprint. This has as consequence that the memory footprint for storing the fingerprints can be large in case a lot of data needs to be stored and the chunk size is small. Especially in large scale environments this is an issue as the hash table in which the fingerprints are stored can be too big to fit in memory.

Another issue is the fact the hash table might get corrupt which basically means your whole storage system is corrupt as the data is still on disk but you lost the map as to where every chunk is stored.

Block storage reality

It is obvious that deduplication only makes sense in case the data to be stored contains many duplicate chunks. Today’s applications already have deduplication built-in at the application level or generate blocks which can’t be deduped. Hence enabling deduplication introduces a performance penalty (additional IO latency, heavier CPU usage, …) without any significant space savings.

Deduplication also made sense when SSD were small in size and expensive compared with traditional SATA drives. By using deduplication it was possible to store more data on the SSD while the penalty of the deduplication overhead was still small. With the latest generation of NVMe drives both arguments have disappeared. The size of NVMe drives is almost on par with SATA drives and the cost has decreased significantly. The latency of these devices is also extremely low, bringing them in range of the overhead introduced by the deduplication. The penalty of deduplication is just too big when using NVMe.

At Open vStorage we try to make the fastest possible distributed block storage solution. In order to keep the performance consistently fast it is essential that the metadata can fit completely in RAM. Every time we need to go to an SSD for metadata, the performance will drop significantly. With deduplication enabled, the metadata size per LBA entry was 8 bit for the SCO and offset and 128 bit of the hash. Hence by eliminating deduplication we can store 16 times more metadata in RAM. Or in our case, we can address a storage pool which is 16 times bigger with the same performance as compared to with deduplication enabled.

One final remark, Open vStorage still uses deduplication when a clone is made from a volume. The clone and its parent share the data upto the point at which the volume is cloned and only the changes to the cloned volume are stored on the backend. This can easily and inexpensively be achieved with 8 bits and they share the same SCOs and offsets.

Performance Tuning

At Open vStorage we now have various large clusters which can easily deliver multiple millions of IOPS. For some customers it is even a prestige project to produce the highest amount of IOPS on their Open vStorage dashboard. Out of the box Open vStorage will already give you very decent performance but there a few nuts and bolts you can tweak to increase the performance of your environment. There is no golden rule to increase the performance but below we share some tips and pointers:

vDisk Settings
The most obvious way to influence the IO performance of a vDisk is by selecting the appropriate settings in the vDisk detail page.The impact of the DTL setting was already covered in a previous blog post so we will skip it here.
Deduplication has also an impact on the write IO performance of the vDisk. In case you know the data isn’t suited for deduplication then don’t turn it on. As we have large read caches, we only set the dedupe feature on for OS disks.
Another setting we typically set at the vPool level is the SCO size. To increase the write performance you typically want to select a large Storage Container Object (SCO) size as to minimize the overhead of the creation and closing of a SCO. Also, backends are typically very good at writing large chunks of sequential data so a big SCO size makes sense. But as usual there is a trade-off. With traditional backends like Swift, Ceph or any other object store for that matter, you need to retrieve the whole SCO from the backend in case of a cache miss. A bigger SCO means in that case more read latency in case of a cache miss. This is one of the reasons why designed our own backend, ALBA. With ALBA you can retrieve a part of a SCO from the backend. Instead of getting a 64MiB SCO, we can get the exact 4k we need from the SCO. ALBA is the only object storage that currently supports this functionality. In large clusters with ALBA as backend we typically set 64 MiB as SCO size. In case you don’t use ALBA, use a lower SCO size.

Optimize the Backends
One of the more less obvious items which can make a huge difference in performance is the right preset. A preset consists out of a set of policies, a compression method (optional) and whether encryption should be activated.
You might ask why tuning the backend might influence the performance on the front-end towards the VM. The performance of the backend will for example influence the read performance in case of a cache miss. Also on writes the backend might become the bottleneck for incoming data. All writes go into the write buffer which is typically sized to contain a couple of SCOs. This is ok in case your backend is fast enough as once a SCO is full, it is ready to be saved on the backend and removed from the write buffer. This way it can make room for newly written data. In case the backend is too slow to manage what comes out of the write buffer, Open vStorage will start throttling the ingest of the data on the frontend. So it is essential to have a look at your backend performance in case it is the bottleneck for the write performance.
Since we typically set the SCO size to 64MiB and think a fragment size of 4MiB is a good size for fragments, we change the policy to have 16 data fragments. The other parameters are depending on the reliability and the amount of hosts used for storage.
Compression is typically turned on but gives distorted results when running your typical random IO benchmark as random data is hard to compress. Data which is hard to compress will even be bigger in size and hence take more time to store. Basically, if you are running benchmarks with random IO it is best to turn compression off.

In case you need help in tweaking the performance of your environment, feel free to contact us.

QEMU, Shared Memory and Open vStorage

qemuQEMU, Shared Memory and Open vStorage, it sounds like the beginning of a bad joke but actually it is a very cool story. Open vStorage secretly released in their latest version a Shared Memory Client/Server integration with the VolumeDriver (the component that offers the fast, distributed block layer). With this implementation the client (QEMU, Blktap, …) can write to a dedicated memory segment on the compute host which is shared with the Shared Memory Server in the Volume Driver. For the moment the Shared Memory client understands only block semantics but in the future we will add file semantics as to integrate an NFS server.

The benefits of the Shared Memory approach are very tangible:

  • As everything is in user-space, data copies from user to kernel space are eliminated so the IO performance is about 30-40% higher.
  • CPU consumption is about half for the same IO performance.
  • Easy way to build additional interfaces (f.e. block devices, iSCSI, … ) on top.

We haven’t integrated our modified QEMU build with Libvirt so at the moment some manual tweaking is still required if you want to give it a go:

Download the volumedriver-dev packages

sudo apt-get install volumedriver-dev

By default the Shared memory Server is disabled. To enable it, update the vPool json (/opt/OpenvStorage/config/storagedriver/storagedriver/vpool_name.json) and add under filesystem an entry “fs_enable_shm_interface”: true,. After adding the entry, restart the Volume Driver for the vPool (restart ovs-volumedriver_vpool_name).
Next, build QEMU from the source. You can find the source here.

git clone https://github.com/openvstorage/qemu.git
cd qemu/
./configure
make
sudo make install

There are 2 ways to create a QEMU vDisk:
Use QEMU to create the disk:

qemu-image create openvstorage:volume 10G

Alternatively create the disk in FUSE and start a VM by using the Open vStorage block driver:

truncate -s 10G /mnt//volume
qemu -drive file=openvstorage:volume,if=virtio,cache=none,format=raw ...

Open vStorage 2.2 alpha 1

Today we released a first version of our upcoming 2.2 release. We will from now one do more frequent releases which will cover the latest changes but keep in mind that these release have not gone through our full QA cycle. Having these ‘alpha’ versions released will allow you, the Open vStorage community, to have earlier access to new features and bugfixes but on the other hand these releases are less stable than our ‘beta’-releases. Documentation for these releases will also not always be available. In case you need help, the Open vStorage Google Groups is there to help.

What is new in the 2.2 alpha 1:

  • Huge VMware performance improvement: we have reworked our NFS integration with VMware and have made significant performance improvements (5-10x faster). Please note this is still experimental and f.e. cloning from template on VMware will in this version not work. But by all means, give it a go and let us know your experience!
  • Status of the physical devices (SSDs and SATA drives) of a Storage Router are now shown in the GUI on the Storage Router Detail page. You can also see in detail which partitions are located on which device. In a later stage we plan to make the partitioning adjustable through the GUI.
  • We have improved the performance and reduced the CPU impact of the GUI.

Small feature improvements:

  • Added check in OVS setup which disallows the possibility to rerun the setup.
  • Cinder gets automatically configured if you configured OpenStack as Hypervisor Management Center.
  • ovs-snmp port is now configurable.
  • Option to add a password when a new user is created.
  • Rename of an OpenStack volume updates the vDisk name.
  • Added the possibility to install the Open vStorage Backend packages after configuring Open vStorage.
  • Improvements to the performance of the ASDs.
  • Option to remove an Open vStorage Backend.
  • Option to define the replication factor of an Open vStorage Backend.
  • Option to enable compression for a Storage Backend.
  • ASD nodes can now be collapsed in the Backend detail page.
  • Highlight the ASDs on which an action applies in the Backend details page.
  • Impact of removing an ASD is made clear.

Bugfixes:

  • Fixed the issue where ASD’s are labeled as dead under high load.
  • Initializing a new disk (as replacement disk of a broken disk) fails.
  • Open vStorage port range 8870+ overlaps with c-api port 8876 causing n-api service on devstack to fail to restart with address already in use.
  • vDisk naming is now more consistent with the reality.
  • Fix for multiple vPools using the same read cache path.
  • Hardening vPool creation.
  • Bugfixes for various issues with the Open vStorage Backend.
  • Fix for dmesg output does not show up in syslog or kern.log
  • Failed to create an ASD if a filesystem exists on the disk.
  • Timestamps not being added in upstart logs.
  • Fix for sync disk with reality sometimes fails.
  • Incorrect permission on ovs user’s .ssh folder causes login using authorized_keys to fail.
  • Fix for issues with rabbitmqctl during install.
  • ovs collect logs doesn’t collect all logs through the GUI.
  • Metadataserver quickly fills up root partition.

How do you install this version:
When installing, add the alpha repo instead of the beta repo.

echo "deb http://apt-ovs.cloudfounders.com alpha/" > /etc/apt/sources.list.d/ovsaptrepo.list

For people using OpenStack:
Before creating a vPool, add the OpenStack controller node as Hypervisor Management Center (Admin > Hypervisor Management Center) and select all hosts on the second part of the screen. When you create a vPool, Cinder will now be automatically deployed and configured. The nova and libvirtd changes as listed in the documentation still need to be applied to the compute hosts though.

Open vStorage 1.5

During the summer the Open vStorage Team has worked very hard. With this new release we can proudly present:

  • Feature complete OpenStack Cinder Plugin: our Cinder Plugin has been improved and meets the minimum features required to be certified by the OpenStack community.
  • Flexible cache layout: in 1.5.0 you have the capability to easily configure multiple SSD devices for the different caching purposes. During the setup you can choose which SSD’s to partition and then later when creating a vPool you can select which caching device should be used for read, write and writecache protection. Meaning these can from now on be spread over different or consolidated into the same SSD device depending on the available hardware and needs.
  • User management: an admin can now create more user which have access to the GUI.
  • Framework performance: a lot of work has been put into improving the performance when a lot of vDisks and vMachines are created. Improvements upto 50% has been reached in some cases.
  • Improved API security by means of implementing OAuth2 authentication. A rate-limit has also been imposed on API calls to prevent brute force attacks.

Fixed bugs and small items:

  • GUI now prevents creation of vPools with a capital letter.
  • Implemented recommendation for a security exploit on elasticsearch 1.1.1.
  • Fix for validation of vPools being stuck on validating.
  • Protection against reusing vPool names towards the same backend.
  • Fix for the Storage Router online/offline detection which failed when openstack was also installed.

Next, we also took the first step towards supporting other OS than Ubuntu (RedHat/Centos). We have created an rpm version of our volumedriver and arakoon packages. These are tested on “Linux Centos7 3.10.0-123.el7.x86_64” and can be downloaded from our packages server. This completes a first important step towards getting Open vStorage RedHat/CentOS compatible.

Webscale 2.0

As a Product Manager I’m very often on challenging calls with potential users of Open vStorage and one of the questions that comes back on almost every call is:

How scalable is Open vStorage?

It is a question that is easy to answer: extremely scalable. Open vStorage is built from the ground up to support environments which have 100+ hosts. It is designed to be used in large datacenters as primary storage platform for all types of Virtual Machine workload. I’m aware that the term scalable is a bit biased and can have different meanings. Did the enquirer mean storage capacity scalability or performance scalability. Well, Open vStorage scales both ways. For the storage capacity, the scalability is mostly limited to the selected backend. For example, with Swift as storage backend of a vPool, you can almost infinitely add disks or storage nodes to enlarge the storage pool. Swift is after all designed with massive scalability as main development mantra and it has shown this quality in production environments of Disney and Rackspace amongst many others.
Performance scalability is also not a problem. Adding more hosts running the Open vStorage software will linearly scale the performance. As each hosts has one or more SSDs or PCIe Flash cards on board, the addition of every host to the Open vStorage environment increases the data that can be stored in the cache.

Does that mean Open vStorage is webscale?

No, unlike other hyperconverged storage solutions, we are not webscale. We are webscale 2.0. The reason why we can call Open vStorage webscale 2.0 is because it decoupled the storage scalability from the performance scalability. This allows for asymmetric architectures. It makes no sense having to add more storage capacity in order to improve the performance of your storage solution. Open vStorage is the only solution which allows to independently scale performance and capacity at a massive scale. Not only is Open vStorage tailored to the needs of large environments with petabytes of data and a battery of compute power but it can also address the needs of a typical enterprise. Whether that typical enterprise has lots of data with limited compute power or vice versa, Open vStorage is up for the job.