The different ALBA Backends explained

open vstorage alba backendsWith the latest release of Open vStorage, Fargo, the backend implementation received a complete revamp in order to better support the geoscale functionality. In a geoscale cluster, the data is spread over multiple datacenters. If one of the datacenters would go offline, the geoscale cluster stays up and running and continues to serve data.

The geoscale functionality is based upon 2 concepts: Backends and vPools. These are probably the 2 most important concepts of the Open vStorage architecture. Allow me to explain in detail what the difference is between a vPool and a Backend.

Backend

A backend is a collections of physical disks, devices or even backends. Next to grouping disks or backends it also defines how data is stored on its constituents. Parameters such as erasure coding/replication factor, compression, encryption need to be defined. Ordinarily a geoscale cluster will have multiple backends. While Eugene, the predecessor release of Fargo, only had 1 type of backend, there are now 2 types: a local and a global backend.

  • A local backend allows to group physical devices. This type is typically used to group disks within the same datacenter.
  • A Global backend allows to combine multiple (local) backends into a single (global) backend. This type of backend typically spans multiple datacenters.

Backends in practice

In each datacenter of an Open vStorage cluster there are multiple local backends. A typical segregation happens based upon the performance of the devices in the datacenter. An SSD backends will be created with devices which are fast and low latency and an HDD backend will be created with slow(er) devices which are optimised for capacity. In some cases the SSD or HDD backend will be split in more backends if they contain many devices for example by selecting every x-th disk of a node. This approach limits the impact of a node failure on a backend.
Note that there is no restriction for a local backend to only use disks within the same datacenter. It is perfectly possible to select disks from different datacenters and add them to the same backend. This doesn’t make sense of course for an SSD backend as the latency between the datacenters will be a performance limiting factor.
Another reason to create multiple backends is if you want to offer each customer his own set of physical disks for security or compliance reasons. In that case a backend is created per customer.

vPool

A vPool is a configuration template for vDisks, volumes being served by Open vStorage. This template contains a whole range of parameters such as blocksize to be used, SCO size on the backend, default write buffer size, preset to be used for data protection, hosts on which the volume can live, the backend where the data needs to be stored and whether data needs to be cached. These last 2 are particularly interesting as they express how different ALBA backends are tied together. When you create a vPool you select a backend to store the volume data. This can be a local backend, SSD for an all-flash experience or a global backend in case you want to spread data over multiple datacenters. This backend is used for every Storage Router serving the vPool. If you use a global backend across multiple datacenters, you will want to use some sort of caching in the local datacenter where the volume is running. Do this in order to keep the read latency as low as possible. To achieve this by assign a local SSD backend when extending a vPool to a certain Storage Router. All volumes being served by that Storage Router will on a read first check if the requested data is in the SSD backend. This means that Storage Routers in different datacenters will use a different cache backend. This approach allows to keep hot data in the local SSD cache and store cold data on the capacity backend which is distributed across datacenters. By using this approach Open vStorage can offer stunning performance while distributing the data across multiple datacenters for safety.

A final note

To summarise, an Open vStorage cluster can have multiple and different ALBA backends: local vs. global backends, SSD and HDD backends. vPools, a grouping of vDisks which share the same config, are the glue between these different backends.

Edge: HA, failure and the moving of volumes explained

edge HA FailoverOpen vStorage is designed to be rock solid and survive failures. These failures can come in many forms and shapes: nodes might die, network connections might get interrupted, … Let’s give an overview of the different tactics that are used by Open vStorage when disaster strikes by going over some possible use cases where the new edge plays a role.

Use case 1: A hypervisor fails

In case the hypervisor fails, the hypervisor management (OpenStack, vCenter, …) will detect the failure and restart the VM on another hypervisor. Since the VM is started on another hypervisor, the VM will talk to the edge client on the new hypervisor. The edge client will connect to a volume driver in the vPool and enquire which volume driver owns the disks of the VM. The volume driver responds who is the owner and the edge connects to the volume driver owning the volume. This all happens almost instantaneously and in the background so the the IO of the VM isn’t affected.

Use case 2: A Storage Router fails

In case a Storage Router and hence the volume driver on it die, the edge client automatically detects that the connection to the volume driver is lost. Luckily the edge keeps a list of volume drivers which also serve the vPool and it connects to one of the remaining volume drivers in the vPool. It is clear that the edge prefers to fail-over to a volume driver which is close-by f.e. within the same datacenter. The new volume driver to which the edge connects detects that it isn’t the owner of the volume. As the old volume driver is no longer online, the new volume driver steals the ownership of the VMs volume. Stealing is allowed in this case as the old volume driver is down. Once the new volume driver becomes the owner of the volumes, the edge client can start serving IO. This whole process process happens in the background and halts the IO of the VM for a fraction of a second.

Use case 3: Network issues

In some exceptional cases it isn’t the hypervisor or the storage router that fails but the network in between. This is an administrator’s worst nightmare as it might lead to split brain scenarios. Even in this case the edge is able to outlive the disaster. As the network connection between the edge and the volume driver is lost, the edge will assume the volume driver is dead. Hence, as in use case 2 the edge connects to another volume driver in the same vPool. The volume driver first tries to contact the old volume driver.

Now there are 2 options:

  • The new volume driver can contact the old volume driver. After some IO is exchanged the new volume driver asks the old volume driver to hand over the volume. This handover doesn’t impact the edge.
  • The new volume driver can also not contact the old volume driver. In that case old volume driver steals the volume from the old volume driver. It does this by updating the ownership of the volume in the distributed DB and by uploading a new key to the backend. As the ALBA backend uses a conditional write approach, it only writes the IO to disks of the backend if the accompanying key is valid, it can ensure only the new volume driver is allowed to write to the backend. If the old volume driver would still be online (split brain) and try to update the backend, the write would fail as it is using an outdated key.

Open vStorage opens up its API kimono

oai
With the Fargo release Open vStorage opens up its API kimono. In earlier versions of Open vStorage the API was something that was well hidden in the documentation section. As a result many of our integration partners had questions on how to use the API, what exactly was possible with the API or for example what the required parameters were to take a snapshot. It was clear for everyone that we had to give the API some more spotlight.

Why an API?

An API is especially important because it dictates how the developers of these integrators can create new apps, websites and services on top of the Open vStorage storage solution. A hosting provider has for example built an OpenStack-like GUI for its KVM + Open vStorage cluster. They create vDisks on Open vStorage directly from their GUI, take snapshots and even scrub the vDisks on demand. They are consuming every aspect of our API. During this integration it became clear that keeping our API documentation up to date was a challenge. The idea grew to make the API self-describing and browsable.

Open API

APIs come in many forms but some standards are crystallizing. Open vStorage follows the Open API specification (OAI). This specification is supported by some of the big names in the IT industry such as Google, Microsoft, IBM and PayPal. It also means some great open-source tools can be leveraged such as NSwag and Swagger UI. NSwag is a Swagger API toolchain for .NET, Web API and TypeScript (jQuery, AngularJS, Angular 2, Aurelia, KnockoutJS, and more). Swagger UI is a tool that dynamically generates beautiful documentation and a sandbox to play with straight from the browser.

Browsable API

To explore the Open vStorage API, download the Swagger UI , unzip the archive and serve the dist folder from either your file system or a web server.

Next, enter in the textbox https://[ip of the GUI]/api/swagger.json and press enter.

open-vstorage-api

You can now browse through the API. As an example you can verify which parameters are required to move a vDisk between Storage Routers.

open-vstorage-api-move-vdisk

One small, but important remark. Currently Swagger-UI doesn’t support OAuth2 yet. This means you can browse the API but you can’t execute API requests as these need to be authenticated.

Moving block storage between datacenters: the Demo

keep-calm-it-s-just-a-bloody-datacenter-moveProbably the coolest feature of the new Fargo release is the GeoScale capability, spreading data across multiple datacenters. With this feature Open vStorage can offer distributed block storage across multiple locations. In the below demo, storage is spread across 3 datacenters in la douce France (Roubaix, Strasbourg and Gravelines). The demo also explains how the storage is spread across these datacenters and shows the live migration of a running VM and its storage between 2 datanceters. The whole migration process completes within a few seconds. The GeoScale functionality can be compared with solving a Sudoku puzzle. The data gets chopped up in chunks which are distributed across all the nodes and datacenters in the cluster. As long as you have enough chunks (disk, nodes or datacenters) left, you can always recover the data. In the demo even a datacenter loss is supported.

GeoScale FAQ

Can I survive a datacenter outage?

Yes, in a GeoScale cluster, the data is spread over multiple datacenters and is available from each location. If one of these datacenters goes offline, the GeoScale cluster stays up and running and continues to serve data. Virtual Machines running in the datacenter that went down can be migrated to one of the other datacenters in just seconds without having to copy all of the data.

Will storing data across multiple datacenters not be too slow for my database, VMs, … ?

No, Open vStorage aggregates all flash (SSDs, NVMe, PCIe) within each datacenter to initiate a global cache. To speed up reads Open vStorage uses these local cache pools to speed up incoming reads and writes.

How far can the datacenters be apart?

Open vStorage supports metroscale clusters where the datacenters are only a couple of miles away, such as the greater New York region, but can even clusters where datacenters a couple of thousands of miles apart are supported.

Support for Ubuntu 16.04

ubuntuLast Friday, November 4th, the Open vStorage team released the first RC of the new Fargo version. We are really excited about Fargo as there are a lot of new features being added to it. To name some of the new features:

  • Support for Ubuntu 16.04.
  • HA for the Edge which allows automatic failover in case the host running the VolumeDriver goes down.
  • Support for Arakoon as distributed config management.
  • 64TB volumes.

Earlier versions of Open vStorage supported Ubuntu 14.04. With the release of Ubuntu 16.04, which is an Ubuntu LTS version and hence will have updates and support for the next 5 years, it was essential for us to also update the Open vStorage software to work on Ubuntu 16.04.

Get started with Ubuntu 16.04:

Installing Open vStorage on Ubuntu 16.04 is almost as easy as installing on 14.04. One change is that the software packages are now signed. Signing the packages allows you, the installer of the packages, to verify that no modifications occurred after the packages were signed. The steps to get the latest packages are as simple as:

  • Download and install Ubuntu 16.04 on the host.
  • Add the Open vStorage repo to the host:
    echo "deb http://apt.openvstorage.com unstable main" > /etc/apt/sources.list.d/ovsaptrepo.list
  • Add the key:
    apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 4EFFB1E7
  • Make sure the Open vStorage packages have a higher preference so our packages are installed:
    cat < /etc/apt/preferences
    Package: *
    Pin: origin apt.openvstorage.com
    Pin-Priority: 1000
    EOF
  • Run apt-get update to get the latest packages

To install the Open vStorage software you can follow the normal flow as described here.

Distributed Config Management

Distributed Config ManagementWhen you are managing large clusters, keeping the configuration of every system up to date can be quite a challenge: new nodes are joining the cluster, old nodes need to be replaced, vPools are created and removed, … . In Eugene and earlier versions we relied on simple config files which were located on each node. It should not come as a surprise that in large clusters it proved to be a challenge to keep the config files in sync. Sometime a clusterwide config parameter was updated while one of the nodes was being rebooted. This had as consequence that the update didn’t make it to the node and after the reboot it kept running with an old config.
For Fargo we decided to tackle this problem. The answer: Distributed Config Management.

Distributed Config Management

All config files are now stored in a distributed config management system. When a component starts, it now retrieves the latest configuration settings from the management system. Let’s have a look at how this works in practice. For example a node is down and we remove the vPool from that node. As the vPool was shrunk, the config for that VolumeDriver is removed from the config management system. When the node restarts it will try to get the latest configuration settings for the vPool from the config management system. As there is no config for the removed vPool, the VolumeDriver will no longer serve the vPool. In a first phase we have added support for Arakoon, our beloved and in-house developed distributed key/value store, as distributed config management system. As an alternative to Arakoon, ETCD has been incorporated but do know that in our own deployments we always use Arakoon (hint).

How to change a config parameter:

Changing parameters in the config management system is very easy through the Open vStorage CLI:

  • ovs config list some: List all keys with the given prefix.
  • ovs config edit some-key: Edit that key in your configured editor. If the key doesn’t exist, it will get created.
  • ovs config get some-key: Print the content of the given key.

The distributed config management also contains a key for all scheduled tasks and jobs. To update the default schedule, edit the key /ovs/framework/scheduling/celery and plan the tasks by adding a crontab style schedule.

Dedupe: The good the bad and the ugly

the-good-the-bad-and-the-uglyOver the years a lot has been written about deduplication (dedupe) and storage. There are people who are dedupe aficionados and there are dedupe haters. At Open vStorage we take a pragmatic approach: we use deduplication when it makes sense. When the team behind Open vStorage designed a backup storage solution 15 years ago, we developed the first CAS (Content Addressed Storage) based backup technology. Using this deduplication technology, customers required 10 times less storage for typical backup processes. As said, we use deduplication when it makes sense and that is why we have decided to disable the deduplication feature in our latest Fargo release.

What is deduplication:

Deduplication is a technique for eliminating duplicate copies in data. This is done by identifying and fingerprinting unique chunks of data. In case a duplicate chunk of data is found, it is replaced by a reference or pointer to the first encountered chunk of data. As the pointer is typically smaller than the actual chunk of data, the amount of storage space to store the complete set of data can hence be reduced.

The Good, the Bad, the Ugly

The Good
Duplication can be a real lifesaver in case you need to store a lot of data on a small device. The deduplication ratio, the amount of storage reduction, can be quite substantial in case there are many identical chunks of data (think the same OS) and if the size of the chunks is a couple of multitudes larger than the size of the pointer/fingerprint.

The Bad
Deduplication can be CPU intensive. It requires to fingerprint each chunk of data and fingerprinting (calculating a hash) is an expensive CPU instruction. This performance penalty will introduce additional latency in the IO write path.

The Ugly
The bigger the size of the chunk, the less likely chunks will be duplicates as even the smallest change of a bit will make sure the chunks are no longer identical. But the smaller the chunks, the smaller the ratio between the chunksize and the fingerprint. This has as consequence that the memory footprint for storing the fingerprints can be large in case a lot of data needs to be stored and the chunk size is small. Especially in large scale environments this is an issue as the hash table in which the fingerprints are stored can be too big to fit in memory.

Another issue is the fact the hash table might get corrupt which basically means your whole storage system is corrupt as the data is still on disk but you lost the map as to where every chunk is stored.

Block storage reality

It is obvious that deduplication only makes sense in case the data to be stored contains many duplicate chunks. Today’s applications already have deduplication built-in at the application level or generate blocks which can’t be deduped. Hence enabling deduplication introduces a performance penalty (additional IO latency, heavier CPU usage, …) without any significant space savings.

Deduplication also made sense when SSD were small in size and expensive compared with traditional SATA drives. By using deduplication it was possible to store more data on the SSD while the penalty of the deduplication overhead was still small. With the latest generation of NVMe drives both arguments have disappeared. The size of NVMe drives is almost on par with SATA drives and the cost has decreased significantly. The latency of these devices is also extremely low, bringing them in range of the overhead introduced by the deduplication. The penalty of deduplication is just too big when using NVMe.

At Open vStorage we try to make the fastest possible distributed block storage solution. In order to keep the performance consistently fast it is essential that the metadata can fit completely in RAM. Every time we need to go to an SSD for metadata, the performance will drop significantly. With deduplication enabled, the metadata size per LBA entry was 8 bit for the SCO and offset and 128 bit of the hash. Hence by eliminating deduplication we can store 16 times more metadata in RAM. Or in our case, we can address a storage pool which is 16 times bigger with the same performance as compared to with deduplication enabled.

One final remark, Open vStorage still uses deduplication when a clone is made from a volume. The clone and its parent share the data upto the point at which the volume is cloned and only the changes to the cloned volume are stored on the backend. This can easily and inexpensively be achieved with 8 bits and they share the same SCOs and offsets.

A healthier cluster begins with OPS: the Open vStorage Health Check

keep-calm-and-let-the-ops-team-handle-itWith more and more big size Open vStorage clusters being deployed, the Open vStorage Operations (OPS) team is tasked with monitoring more servers. In the rare case there is an issue with a cluster, the OPS team wants to get a quick idea of how serious the problems is. That is why the Open vStorage OPS team added another project to the GitHub repo: openvstorage-health-check.

The Open vStorage health check is a quick diagnostic tool to verify if all components on an Open vStorage node are working fine. It will for example check if all services and Arakoon databases are up and running, Memcache, RabbitMQ and Celery are behaving and if presets and backends are still operational.

Note that the health check is only a diagnostic tool. Hence it will not take any action to repair the cluster.

Get Started:

To install the Open vStorage health check on a node, execute:

apt-get install openvstorage-health-check

Next, run the health check by executing

ovs healthcheck

As always, this is work in progress so feel free to file a bug or a feature request for missing functionality. Pull Request are welcomed and will be accepted after careful review by the Open vStorage OPS team.

An example output of the Open vStorage health check:

root@perf-roub-04:~# ovs healthcheck
[INFO] Starting Open vStorage Health Check!
[INFO] ====================================
[INFO] Fetching LOCAL information of node:
[SUCCESS] Cluster ID: 3vvwuO9dd1S2sNIi
[SUCCESS] Hostname: perf-roub-04
[SUCCESS] Storagerouter ID: 6Y6uerfmfZaoZOCu
[SUCCESS] Storagerouter TYPE: EXTRA
[SUCCESS] Environment RELEASE: Fargo
[SUCCESS] Environment BRANCH: Unstable
[INFO] Checking LOCAL OVS services:
[SUCCESS] Service ‘ovs-albaproxy_geo-accel-alba’ is running!
[SUCCESS] Service ‘ovs-workers’ is running!
[SUCCESS] Service ‘ovs-watcher-framework’ is running!
[SUCCESS] Service ‘ovs-dtl_local-flash-roub’ is running!
[SUCCESS] Service ‘ovs-dtl_local-hdd-roub’ is running!

[INFO] Checking ALBA proxy ‘albaproxy_local-flash-roub’:
[SUCCESS] Namespace successfully created or already existed on proxy ‘albaproxy_local-flash-roub’ with preset ‘default’!
[SUCCESS] Creation of a object in namespace ‘ovs-healthcheck-ns-default’ on proxy ‘albaproxy_local-flash-roub’ with preset ‘default’ succeeded!
[SUCCESS] Namespace successfully created or already existed on proxy ‘albaproxy_local-flash-roub’ with preset ‘high’!
[SUCCESS] Creation of a object in namespace ‘ovs-healthcheck-ns-high’ on proxy ‘albaproxy_local-flash-roub’ with preset ‘high’ succeeded!
[SUCCESS] Namespace successfully created or already existed on proxy ‘albaproxy_local-flash-roub’ with preset ‘low’!
[SUCCESS] Creation of a object in namespace ‘ovs-healthcheck-ns-low’ on proxy ‘albaproxy_local-flash-roub’ with preset ‘low’ succeeded!
[INFO] Checking the ALBA ASDs …
[SKIPPED] Skipping ASD check because this is a EXTRA node …
[INFO] Recap of Health Check!
[INFO] ======================
[SUCCESS] SUCCESS=154 FAILED=0 SKIPPED=20 WARNING=0 EXCEPTION=0

Fargo: the updated Open vStorage Architecture

With the Fargo release of Open vStorage we are focussing even more on the Open vStorage sweet spot: multi-petabyte, multi-datacenter storage clusters which offer super-fast block storage.
In order to achieve this we had to significantly change the architecture for the Fargo release. Eugene, the version before Fargo, already had the Shared Memory Server (SHM) in its code base but its wasn’t activated by default. The Fargo release now primarily uses the SHM approach. To make even more use of it, we created the Open vStorage Edge. The Edge is a lightweight block storage driver which can be installed on Linux servers (hosts running the hypervisor or inside the VM) and talks across the network to the Shared Memory of a remote Volume Driver. Both TCP/IP and the low latency RDMA protocol can be used to connect the Edge with the Volume Driver. Northbound the Edge has an iSCSI, Blktap and QEMU interface. Additional interfaces such as iSER and FCoE are planned. Next to the new Edge interface, the slower Virtual Machine interface which exposes a Virtual File System (NFS, FUSE), is still supported.

Architecture

The Volume Driver has also been optimized for performance. The locks in the write path have been revised in order to minimize their impact. More radical is the decision to remove the deduplication functionality from the Volume Driver in order to keep the size of the metadata of the volumes to a strict minimum. By removing the bytes reserved for the hash, we are capable of keeping all the metadata in RAM and push the performance across 1 million IOPS per host on decent hardware. For those who absolutely need deduplication there is still a version available of the Volume Driver which has support for deduplication.

With the breakthrough of RDMA, the network bottleneck is removed and network latency is brought down to a couple of microseconds. Open vStorage makes use of the possibilities RDMA offers to implement a shared cache layer. To achieve this it is now possible to create an ALBA backend out of NVMe or SSD devices. This layer acts as a local, within a single datacenter, cache layer in front of an SATA ALBA backend, the capacity tier, which is spread across multiple datacenters.
This means all SSDs in a single datacenter devise a shared cache for the data of that datacenter. This minimizes the impact of an SSD failure and removes the cold cache effect when moving a volume between hosts. In order to minimize the impact of a single disk failure we introduced the NC-ECC (Network and Clustered Error Correction Codes) algorithm. This algorithm can be compared with solving a Sudoku puzzle. Each SCO, a collection of consecutive writes, is chopped up in chunks. All these chunks are distributed across all the nodes and datacenters in the cluster. The total amount of chunks can be configured but allows for example to recover from a multi node failure or a complete datacenter loss. A failure, whether it is a disk, node or datacenter will cross out some numbers from the complete Sudoku puzzle but as long as you have enough numbers left, you can still solve the puzzle. The same goes for data stored with Open vStorage: as long as you have enough chunks (disk, nodes or datacenters) left, you can always recover the data. The NC-ECC algorithm is based on forward error correction codes and is further optimized for usage within a multi-datacenter approach. When there is a disk or node failure, additional chunks will be created using only data from within the same datacenter. This ensures the bandwidth between datacenters isn’t stressed in case of a simple disk failure.

By splitting up the Edge, the Volume Driver, the cache layer and the capacity tier, you have the ultimate flexibility to build the storage cluster of your needs. You can run everything on the same server, hyperconverged, or you can install each component on a dedicated server to maximize scalability and performance.

The first alpha version of Fargo is now available on the repo.