Open vStorage opens up its API kimono

oai
With the Fargo release Open vStorage opens up its API kimono. In earlier versions of Open vStorage the API was something that was well hidden in the documentation section. As a result many of our integration partners had questions on how to use the API, what exactly was possible with the API or for example what the required parameters were to take a snapshot. It was clear for everyone that we had to give the API some more spotlight.

Why an API?

An API is especially important because it dictates how the developers of these integrators can create new apps, websites and services on top of the Open vStorage storage solution. A hosting provider has for example built an OpenStack-like GUI for its KVM + Open vStorage cluster. They create vDisks on Open vStorage directly from their GUI, take snapshots and even scrub the vDisks on demand. They are consuming every aspect of our API. During this integration it became clear that keeping our API documentation up to date was a challenge. The idea grew to make the API self-describing and browsable.

Open API

APIs come in many forms but some standards are crystallizing. Open vStorage follows the Open API specification (OAI). This specification is supported by some of the big names in the IT industry such as Google, Microsoft, IBM and PayPal. It also means some great open-source tools can be leveraged such as NSwag and Swagger UI. NSwag is a Swagger API toolchain for .NET, Web API and TypeScript (jQuery, AngularJS, Angular 2, Aurelia, KnockoutJS, and more). Swagger UI is a tool that dynamically generates beautiful documentation and a sandbox to play with straight from the browser.

Browsable API

To explore the Open vStorage API, download the Swagger UI , unzip the archive and serve the dist folder from either your file system or a web server.

Next, enter in the textbox https://[ip of the GUI]/api/swagger.json and press enter.

open-vstorage-api

You can now browse through the API. As an example you can verify which parameters are required to move a vDisk between Storage Routers.

open-vstorage-api-move-vdisk

One small, but important remark. Currently Swagger-UI doesn’t support OAuth2 yet. This means you can browse the API but you can’t execute API requests as these need to be authenticated.

Open vStorage Releases

release-managementSince Open vStorage is running in production at customers we need to carefully plan our releases as a small glitch might cause a disaster. For storage software there is a golden rule

If it ain’t broken, don’t fix it!

With the release of Fargo RC1 we are entering a new cycle of intermediate releases and bugfixes. Once Fargo is GA we will push out a new update at regular intervals. Before installing an update customers like to know what is exactly fixed in a certain update. That is why for each release, even an intermediate release, the release notes are documented. Let’s take as an example the Fargo Release Candidate 1. This release consists out of following packages:

The content of each package e.g. the webapps package can be found on the appropriate repository (or you can click the link in the release notes). The release notes of the package contain a summary of all fixed issues in that exact package. In case you want to be kept up to date of new releases, add the the release page as RSS feed (https://github.com/openvstorage/home/releases.atom) to your favourite RSS Feed reader. If you prefer to be kept up to date by email, you can use Sibbell, Blogtrottr or a similar service.

Moving block storage between datacenters: the Demo

keep-calm-it-s-just-a-bloody-datacenter-moveProbably the coolest feature of the new Fargo release is the GeoScale capability, spreading data across multiple datacenters. With this feature Open vStorage can offer distributed block storage across multiple locations. In the below demo, storage is spread across 3 datacenters in la douce France (Roubaix, Strasbourg and Gravelines). The demo also explains how the storage is spread across these datacenters and shows the live migration of a running VM and its storage between 2 datanceters. The whole migration process completes within a few seconds. The GeoScale functionality can be compared with solving a Sudoku puzzle. The data gets chopped up in chunks which are distributed across all the nodes and datacenters in the cluster. As long as you have enough chunks (disk, nodes or datacenters) left, you can always recover the data. In the demo even a datacenter loss is supported.

GeoScale FAQ

Can I survive a datacenter outage?

Yes, in a GeoScale cluster, the data is spread over multiple datacenters and is available from each location. If one of these datacenters goes offline, the GeoScale cluster stays up and running and continues to serve data. Virtual Machines running in the datacenter that went down can be migrated to one of the other datacenters in just seconds without having to copy all of the data.

Will storing data across multiple datacenters not be too slow for my database, VMs, … ?

No, Open vStorage aggregates all flash (SSDs, NVMe, PCIe) within each datacenter to initiate a global cache. To speed up reads Open vStorage uses these local cache pools to speed up incoming reads and writes.

How far can the datacenters be apart?

Open vStorage supports metroscale clusters where the datacenters are only a couple of miles away, such as the greater New York region, but can even clusters where datacenters a couple of thousands of miles apart are supported.

Support for Ubuntu 16.04

ubuntuLast Friday, November 4th, the Open vStorage team released the first RC of the new Fargo version. We are really excited about Fargo as there are a lot of new features being added to it. To name some of the new features:

  • Support for Ubuntu 16.04.
  • HA for the Edge which allows automatic failover in case the host running the VolumeDriver goes down.
  • Support for Arakoon as distributed config management.
  • 64TB volumes.

Earlier versions of Open vStorage supported Ubuntu 14.04. With the release of Ubuntu 16.04, which is an Ubuntu LTS version and hence will have updates and support for the next 5 years, it was essential for us to also update the Open vStorage software to work on Ubuntu 16.04.

Get started with Ubuntu 16.04:

Installing Open vStorage on Ubuntu 16.04 is almost as easy as installing on 14.04. One change is that the software packages are now signed. Signing the packages allows you, the installer of the packages, to verify that no modifications occurred after the packages were signed. The steps to get the latest packages are as simple as:

  • Download and install Ubuntu 16.04 on the host.
  • Add the Open vStorage repo to the host:
    echo "deb http://apt.openvstorage.com unstable main" > /etc/apt/sources.list.d/ovsaptrepo.list
  • Add the key:
    apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 4EFFB1E7
  • Make sure the Open vStorage packages have a higher preference so our packages are installed:
    cat < /etc/apt/preferences
    Package: *
    Pin: origin apt.openvstorage.com
    Pin-Priority: 1000
    EOF
  • Run apt-get update to get the latest packages

To install the Open vStorage software you can follow the normal flow as described here.

Distributed Config Management

Distributed Config ManagementWhen you are managing large clusters, keeping the configuration of every system up to date can be quite a challenge: new nodes are joining the cluster, old nodes need to be replaced, vPools are created and removed, … . In Eugene and earlier versions we relied on simple config files which were located on each node. It should not come as a surprise that in large clusters it proved to be a challenge to keep the config files in sync. Sometime a clusterwide config parameter was updated while one of the nodes was being rebooted. This had as consequence that the update didn’t make it to the node and after the reboot it kept running with an old config.
For Fargo we decided to tackle this problem. The answer: Distributed Config Management.

Distributed Config Management

All config files are now stored in a distributed config management system. When a component starts, it now retrieves the latest configuration settings from the management system. Let’s have a look at how this works in practice. For example a node is down and we remove the vPool from that node. As the vPool was shrunk, the config for that VolumeDriver is removed from the config management system. When the node restarts it will try to get the latest configuration settings for the vPool from the config management system. As there is no config for the removed vPool, the VolumeDriver will no longer serve the vPool. In a first phase we have added support for Arakoon, our beloved and in-house developed distributed key/value store, as distributed config management system. As an alternative to Arakoon, ETCD has been incorporated but do know that in our own deployments we always use Arakoon (hint).

How to change a config parameter:

Changing parameters in the config management system is very easy through the Open vStorage CLI:

  • ovs config list some: List all keys with the given prefix.
  • ovs config edit some-key: Edit that key in your configured editor. If the key doesn’t exist, it will get created.
  • ovs config get some-key: Print the content of the given key.

The distributed config management also contains a key for all scheduled tasks and jobs. To update the default schedule, edit the key /ovs/framework/scheduling/celery and plan the tasks by adding a crontab style schedule.

Dedupe: The good the bad and the ugly

the-good-the-bad-and-the-uglyOver the years a lot has been written about deduplication (dedupe) and storage. There are people who are dedupe aficionados and there are dedupe haters. At Open vStorage we take a pragmatic approach: we use deduplication when it makes sense. When the team behind Open vStorage designed a backup storage solution 15 years ago, we developed the first CAS (Content Addressed Storage) based backup technology. Using this deduplication technology, customers required 10 times less storage for typical backup processes. As said, we use deduplication when it makes sense and that is why we have decided to disable the deduplication feature in our latest Fargo release.

What is deduplication:

Deduplication is a technique for eliminating duplicate copies in data. This is done by identifying and fingerprinting unique chunks of data. In case a duplicate chunk of data is found, it is replaced by a reference or pointer to the first encountered chunk of data. As the pointer is typically smaller than the actual chunk of data, the amount of storage space to store the complete set of data can hence be reduced.

The Good, the Bad, the Ugly

The Good
Duplication can be a real lifesaver in case you need to store a lot of data on a small device. The deduplication ratio, the amount of storage reduction, can be quite substantial in case there are many identical chunks of data (think the same OS) and if the size of the chunks is a couple of multitudes larger than the size of the pointer/fingerprint.

The Bad
Deduplication can be CPU intensive. It requires to fingerprint each chunk of data and fingerprinting (calculating a hash) is an expensive CPU instruction. This performance penalty will introduce additional latency in the IO write path.

The Ugly
The bigger the size of the chunk, the less likely chunks will be duplicates as even the smallest change of a bit will make sure the chunks are no longer identical. But the smaller the chunks, the smaller the ratio between the chunksize and the fingerprint. This has as consequence that the memory footprint for storing the fingerprints can be large in case a lot of data needs to be stored and the chunk size is small. Especially in large scale environments this is an issue as the hash table in which the fingerprints are stored can be too big to fit in memory.

Another issue is the fact the hash table might get corrupt which basically means your whole storage system is corrupt as the data is still on disk but you lost the map as to where every chunk is stored.

Block storage reality

It is obvious that deduplication only makes sense in case the data to be stored contains many duplicate chunks. Today’s applications already have deduplication built-in at the application level or generate blocks which can’t be deduped. Hence enabling deduplication introduces a performance penalty (additional IO latency, heavier CPU usage, …) without any significant space savings.

Deduplication also made sense when SSD were small in size and expensive compared with traditional SATA drives. By using deduplication it was possible to store more data on the SSD while the penalty of the deduplication overhead was still small. With the latest generation of NVMe drives both arguments have disappeared. The size of NVMe drives is almost on par with SATA drives and the cost has decreased significantly. The latency of these devices is also extremely low, bringing them in range of the overhead introduced by the deduplication. The penalty of deduplication is just too big when using NVMe.

At Open vStorage we try to make the fastest possible distributed block storage solution. In order to keep the performance consistently fast it is essential that the metadata can fit completely in RAM. Every time we need to go to an SSD for metadata, the performance will drop significantly. With deduplication enabled, the metadata size per LBA entry was 8 bit for the SCO and offset and 128 bit of the hash. Hence by eliminating deduplication we can store 16 times more metadata in RAM. Or in our case, we can address a storage pool which is 16 times bigger with the same performance as compared to with deduplication enabled.

One final remark, Open vStorage still uses deduplication when a clone is made from a volume. The clone and its parent share the data upto the point at which the volume is cloned and only the changes to the cloned volume are stored on the backend. This can easily and inexpensively be achieved with 8 bits and they share the same SCOs and offsets.

A healthier cluster begins with OPS: the Open vStorage Health Check

keep-calm-and-let-the-ops-team-handle-itWith more and more big size Open vStorage clusters being deployed, the Open vStorage Operations (OPS) team is tasked with monitoring more servers. In the rare case there is an issue with a cluster, the OPS team wants to get a quick idea of how serious the problems is. That is why the Open vStorage OPS team added another project to the GitHub repo: openvstorage-health-check.

The Open vStorage health check is a quick diagnostic tool to verify if all components on an Open vStorage node are working fine. It will for example check if all services and Arakoon databases are up and running, Memcache, RabbitMQ and Celery are behaving and if presets and backends are still operational.

Note that the health check is only a diagnostic tool. Hence it will not take any action to repair the cluster.

Get Started:

To install the Open vStorage health check on a node, execute:

apt-get install openvstorage-health-check

Next, run the health check by executing

ovs healthcheck

As always, this is work in progress so feel free to file a bug or a feature request for missing functionality. Pull Request are welcomed and will be accepted after careful review by the Open vStorage OPS team.

An example output of the Open vStorage health check:

root@perf-roub-04:~# ovs healthcheck
[INFO] Starting Open vStorage Health Check!
[INFO] ====================================
[INFO] Fetching LOCAL information of node:
[SUCCESS] Cluster ID: 3vvwuO9dd1S2sNIi
[SUCCESS] Hostname: perf-roub-04
[SUCCESS] Storagerouter ID: 6Y6uerfmfZaoZOCu
[SUCCESS] Storagerouter TYPE: EXTRA
[SUCCESS] Environment RELEASE: Fargo
[SUCCESS] Environment BRANCH: Unstable
[INFO] Checking LOCAL OVS services:
[SUCCESS] Service ‘ovs-albaproxy_geo-accel-alba’ is running!
[SUCCESS] Service ‘ovs-workers’ is running!
[SUCCESS] Service ‘ovs-watcher-framework’ is running!
[SUCCESS] Service ‘ovs-dtl_local-flash-roub’ is running!
[SUCCESS] Service ‘ovs-dtl_local-hdd-roub’ is running!

[INFO] Checking ALBA proxy ‘albaproxy_local-flash-roub’:
[SUCCESS] Namespace successfully created or already existed on proxy ‘albaproxy_local-flash-roub’ with preset ‘default’!
[SUCCESS] Creation of a object in namespace ‘ovs-healthcheck-ns-default’ on proxy ‘albaproxy_local-flash-roub’ with preset ‘default’ succeeded!
[SUCCESS] Namespace successfully created or already existed on proxy ‘albaproxy_local-flash-roub’ with preset ‘high’!
[SUCCESS] Creation of a object in namespace ‘ovs-healthcheck-ns-high’ on proxy ‘albaproxy_local-flash-roub’ with preset ‘high’ succeeded!
[SUCCESS] Namespace successfully created or already existed on proxy ‘albaproxy_local-flash-roub’ with preset ‘low’!
[SUCCESS] Creation of a object in namespace ‘ovs-healthcheck-ns-low’ on proxy ‘albaproxy_local-flash-roub’ with preset ‘low’ succeeded!
[INFO] Checking the ALBA ASDs …
[SKIPPED] Skipping ASD check because this is a EXTRA node …
[INFO] Recap of Health Check!
[INFO] ======================
[SUCCESS] SUCCESS=154 FAILED=0 SKIPPED=20 WARNING=0 EXCEPTION=0

Fargo: the updated Open vStorage Architecture

With the Fargo release of Open vStorage we are focussing even more on the Open vStorage sweet spot: multi-petabyte, multi-datacenter storage clusters which offer super-fast block storage.
In order to achieve this we had to significantly change the architecture for the Fargo release. Eugene, the version before Fargo, already had the Shared Memory Server (SHM) in its code base but its wasn’t activated by default. The Fargo release now primarily uses the SHM approach. To make even more use of it, we created the Open vStorage Edge. The Edge is a lightweight block storage driver which can be installed on Linux servers (hosts running the hypervisor or inside the VM) and talks across the network to the Shared Memory of a remote Volume Driver. Both TCP/IP and the low latency RDMA protocol can be used to connect the Edge with the Volume Driver. Northbound the Edge has an iSCSI, Blktap and QEMU interface. Additional interfaces such as iSER and FCoE are planned. Next to the new Edge interface, the slower Virtual Machine interface which exposes a Virtual File System (NFS, FUSE), is still supported.

Architecture

The Volume Driver has also been optimized for performance. The locks in the write path have been revised in order to minimize their impact. More radical is the decision to remove the deduplication functionality from the Volume Driver in order to keep the size of the metadata of the volumes to a strict minimum. By removing the bytes reserved for the hash, we are capable of keeping all the metadata in RAM and push the performance across 1 million IOPS per host on decent hardware. For those who absolutely need deduplication there is still a version available of the Volume Driver which has support for deduplication.

With the breakthrough of RDMA, the network bottleneck is removed and network latency is brought down to a couple of microseconds. Open vStorage makes use of the possibilities RDMA offers to implement a shared cache layer. To achieve this it is now possible to create an ALBA backend out of NVMe or SSD devices. This layer acts as a local, within a single datacenter, cache layer in front of an SATA ALBA backend, the capacity tier, which is spread across multiple datacenters.
This means all SSDs in a single datacenter devise a shared cache for the data of that datacenter. This minimizes the impact of an SSD failure and removes the cold cache effect when moving a volume between hosts. In order to minimize the impact of a single disk failure we introduced the NC-ECC (Network and Clustered Error Correction Codes) algorithm. This algorithm can be compared with solving a Sudoku puzzle. Each SCO, a collection of consecutive writes, is chopped up in chunks. All these chunks are distributed across all the nodes and datacenters in the cluster. The total amount of chunks can be configured but allows for example to recover from a multi node failure or a complete datacenter loss. A failure, whether it is a disk, node or datacenter will cross out some numbers from the complete Sudoku puzzle but as long as you have enough numbers left, you can still solve the puzzle. The same goes for data stored with Open vStorage: as long as you have enough chunks (disk, nodes or datacenters) left, you can always recover the data. The NC-ECC algorithm is based on forward error correction codes and is further optimized for usage within a multi-datacenter approach. When there is a disk or node failure, additional chunks will be created using only data from within the same datacenter. This ensures the bandwidth between datacenters isn’t stressed in case of a simple disk failure.

By splitting up the Edge, the Volume Driver, the cache layer and the capacity tier, you have the ultimate flexibility to build the storage cluster of your needs. You can run everything on the same server, hyperconverged, or you can install each component on a dedicated server to maximize scalability and performance.

The first alpha version of Fargo is now available on the repo.

Domains and Recovery Domains

In the Fargo release we introduced a new concept: Domains. In this blog post you can find a description of what Domains exactly are and why and how you should configure them.

A Domain is a logical grouping of Storage Routers. You can compare a domain to an availability zone in OpenStack or a region in AWS. A Domain typically group Storage Routers which can fail for a common reason f.e. because they are on the same power feed or within the same datacenter.

Open vStorage can survive a node failure without any data loss for the VMs on that node. Even data in the write buffer which isn’t on the backend yet is safeguarded on another node by the Distributed Transaction Log. The key element in having no data loss is that the node running the volume and the node running the DTL should not be down at the same time. To limit the risk of both being down at the same time, you should make sure the the DTL is on a node which is not on the same rack or on the same power feed. The Open vStorage can of course not detect which servers are in the same rack so it is up to the user to define different Domains and assign Storage Routers to them.

As a first step create the different Domains in the Administration section (Administration > Domains). You are free to select how you want to group the Storage Routers. A few possible examples are per rack, power feed or even per datacenter, … . In the below example we have grouped the Storage Routers per datacenter.

domains

Next, go to the detail page of each Storage Router and click the edit button.

storage router

Select the Domain, where the actual volumes is hosted, and optionally select a Recovery Domain. In case the Recovery Domain is empty, the DTL will be located in the Domain of the Storage Router. In case a Recovery Domain is selected, it will host the DTL for volumes being served by that Storage Router. Note that you can only assign a Domain as Recovery Domain if at least a single Storage Router is using it as Domain. To make sure that the latency of the DTL doesn’t become a bottleneck for the write IO it strongly advised to have a low latency network between the Storage Routers in the Domain and the Recovery Domain.

Another area where Domains play a role is the location of the MetaDataServer (MDS). The master and a slave MDS will always be located in the Domain of the Storage Router.
In case you configure a Recovery Domain, a MDS slave will also be located on one of the hosts of the Recovery Domain. This additional slave will make sure there is only a limited metadata rebuild necessary to bring the volume live.