The different ALBA Backends explained

open vstorage alba backendsWith the latest release of Open vStorage, Fargo, the backend implementation received a complete revamp in order to better support the geoscale functionality. In a geoscale cluster, the data is spread over multiple datacenters. If one of the datacenters would go offline, the geoscale cluster stays up and running and continues to serve data.

The geoscale functionality is based upon 2 concepts: Backends and vPools. These are probably the 2 most important concepts of the Open vStorage architecture. Allow me to explain in detail what the difference is between a vPool and a Backend.

Backend

A backend is a collections of physical disks, devices or even backends. Next to grouping disks or backends it also defines how data is stored on its constituents. Parameters such as erasure coding/replication factor, compression, encryption need to be defined. Ordinarily a geoscale cluster will have multiple backends. While Eugene, the predecessor release of Fargo, only had 1 type of backend, there are now 2 types: a local and a global backend.

  • A local backend allows to group physical devices. This type is typically used to group disks within the same datacenter.
  • A Global backend allows to combine multiple (local) backends into a single (global) backend. This type of backend typically spans multiple datacenters.

Backends in practice

In each datacenter of an Open vStorage cluster there are multiple local backends. A typical segregation happens based upon the performance of the devices in the datacenter. An SSD backends will be created with devices which are fast and low latency and an HDD backend will be created with slow(er) devices which are optimised for capacity. In some cases the SSD or HDD backend will be split in more backends if they contain many devices for example by selecting every x-th disk of a node. This approach limits the impact of a node failure on a backend.
Note that there is no restriction for a local backend to only use disks within the same datacenter. It is perfectly possible to select disks from different datacenters and add them to the same backend. This doesn’t make sense of course for an SSD backend as the latency between the datacenters will be a performance limiting factor.
Another reason to create multiple backends is if you want to offer each customer his own set of physical disks for security or compliance reasons. In that case a backend is created per customer.

vPool

A vPool is a configuration template for vDisks, volumes being served by Open vStorage. This template contains a whole range of parameters such as blocksize to be used, SCO size on the backend, default write buffer size, preset to be used for data protection, hosts on which the volume can live, the backend where the data needs to be stored and whether data needs to be cached. These last 2 are particularly interesting as they express how different ALBA backends are tied together. When you create a vPool you select a backend to store the volume data. This can be a local backend, SSD for an all-flash experience or a global backend in case you want to spread data over multiple datacenters. This backend is used for every Storage Router serving the vPool. If you use a global backend across multiple datacenters, you will want to use some sort of caching in the local datacenter where the volume is running. Do this in order to keep the read latency as low as possible. To achieve this by assign a local SSD backend when extending a vPool to a certain Storage Router. All volumes being served by that Storage Router will on a read first check if the requested data is in the SSD backend. This means that Storage Routers in different datacenters will use a different cache backend. This approach allows to keep hot data in the local SSD cache and store cold data on the capacity backend which is distributed across datacenters. By using this approach Open vStorage can offer stunning performance while distributing the data across multiple datacenters for safety.

A final note

To summarise, an Open vStorage cluster can have multiple and different ALBA backends: local vs. global backends, SSD and HDD backends. vPools, a grouping of vDisks which share the same config, are the glue between these different backends.

Fargo RC2

We released Fargo RC2 . Biggest new items in this release:

  • Multiple performance improvements such as multiple proxies per volume driver (the default amount is 2), bypassing the proxy and go straight from the volume driver to the ASD in case of partial reads, local read preference in case of global backends (try to read from ASDs in the same datacenter instead of going over the network to another datacenter).
  • API to limit the amount of data that gets loaded into the memory of the volume driver host. Instead of loading all metadata ofa vdisk into RAM, you can now specify the % it can take in RAM.
  • Counter which keeps track of the amount of invalid checksum per ASD so we can flag bad ASDs faster.
  • Configuring the scub proxy to be cache on write.
  • Implemented timeouts for the volume driver calls.

The team also solved 110 issues between RC1 and RC2. An overview of the complete content can be found here: Added Features | Added Improvements | Solved Bugs

Open vStorage Releases

release-managementSince Open vStorage is running in production at customers we need to carefully plan our releases as a small glitch might cause a disaster. For storage software there is a golden rule

If it ain’t broken, don’t fix it!

With the release of Fargo RC1 we are entering a new cycle of intermediate releases and bugfixes. Once Fargo is GA we will push out a new update at regular intervals. Before installing an update customers like to know what is exactly fixed in a certain update. That is why for each release, even an intermediate release, the release notes are documented. Let’s take as an example the Fargo Release Candidate 1. This release consists out of following packages:

The content of each package e.g. the webapps package can be found on the appropriate repository (or you can click the link in the release notes). The release notes of the package contain a summary of all fixed issues in that exact package. In case you want to be kept up to date of new releases, add the the release page as RSS feed (https://github.com/openvstorage/home/releases.atom) to your favourite RSS Feed reader. If you prefer to be kept up to date by email, you can use Sibbell, Blogtrottr or a similar service.

Moving block storage between datacenters: the Demo

keep-calm-it-s-just-a-bloody-datacenter-moveProbably the coolest feature of the new Fargo release is the GeoScale capability, spreading data across multiple datacenters. With this feature Open vStorage can offer distributed block storage across multiple locations. In the below demo, storage is spread across 3 datacenters in la douce France (Roubaix, Strasbourg and Gravelines). The demo also explains how the storage is spread across these datacenters and shows the live migration of a running VM and its storage between 2 datanceters. The whole migration process completes within a few seconds. The GeoScale functionality can be compared with solving a Sudoku puzzle. The data gets chopped up in chunks which are distributed across all the nodes and datacenters in the cluster. As long as you have enough chunks (disk, nodes or datacenters) left, you can always recover the data. In the demo even a datacenter loss is supported.

GeoScale FAQ

Can I survive a datacenter outage?

Yes, in a GeoScale cluster, the data is spread over multiple datacenters and is available from each location. If one of these datacenters goes offline, the GeoScale cluster stays up and running and continues to serve data. Virtual Machines running in the datacenter that went down can be migrated to one of the other datacenters in just seconds without having to copy all of the data.

Will storing data across multiple datacenters not be too slow for my database, VMs, … ?

No, Open vStorage aggregates all flash (SSDs, NVMe, PCIe) within each datacenter to initiate a global cache. To speed up reads Open vStorage uses these local cache pools to speed up incoming reads and writes.

How far can the datacenters be apart?

Open vStorage supports metroscale clusters where the datacenters are only a couple of miles away, such as the greater New York region, but can even clusters where datacenters a couple of thousands of miles apart are supported.

Distributed Config Management

Distributed Config ManagementWhen you are managing large clusters, keeping the configuration of every system up to date can be quite a challenge: new nodes are joining the cluster, old nodes need to be replaced, vPools are created and removed, … . In Eugene and earlier versions we relied on simple config files which were located on each node. It should not come as a surprise that in large clusters it proved to be a challenge to keep the config files in sync. Sometime a clusterwide config parameter was updated while one of the nodes was being rebooted. This had as consequence that the update didn’t make it to the node and after the reboot it kept running with an old config.
For Fargo we decided to tackle this problem. The answer: Distributed Config Management.

Distributed Config Management

All config files are now stored in a distributed config management system. When a component starts, it now retrieves the latest configuration settings from the management system. Let’s have a look at how this works in practice. For example a node is down and we remove the vPool from that node. As the vPool was shrunk, the config for that VolumeDriver is removed from the config management system. When the node restarts it will try to get the latest configuration settings for the vPool from the config management system. As there is no config for the removed vPool, the VolumeDriver will no longer serve the vPool. In a first phase we have added support for Arakoon, our beloved and in-house developed distributed key/value store, as distributed config management system. As an alternative to Arakoon, ETCD has been incorporated but do know that in our own deployments we always use Arakoon (hint).

How to change a config parameter:

Changing parameters in the config management system is very easy through the Open vStorage CLI:

  • ovs config list some: List all keys with the given prefix.
  • ovs config edit some-key: Edit that key in your configured editor. If the key doesn’t exist, it will get created.
  • ovs config get some-key: Print the content of the given key.

The distributed config management also contains a key for all scheduled tasks and jobs. To update the default schedule, edit the key /ovs/framework/scheduling/celery and plan the tasks by adding a crontab style schedule.

Fargo: the updated Open vStorage Architecture

With the Fargo release of Open vStorage we are focussing even more on the Open vStorage sweet spot: multi-petabyte, multi-datacenter storage clusters which offer super-fast block storage.
In order to achieve this we had to significantly change the architecture for the Fargo release. Eugene, the version before Fargo, already had the Shared Memory Server (SHM) in its code base but its wasn’t activated by default. The Fargo release now primarily uses the SHM approach. To make even more use of it, we created the Open vStorage Edge. The Edge is a lightweight block storage driver which can be installed on Linux servers (hosts running the hypervisor or inside the VM) and talks across the network to the Shared Memory of a remote Volume Driver. Both TCP/IP and the low latency RDMA protocol can be used to connect the Edge with the Volume Driver. Northbound the Edge has an iSCSI, Blktap and QEMU interface. Additional interfaces such as iSER and FCoE are planned. Next to the new Edge interface, the slower Virtual Machine interface which exposes a Virtual File System (NFS, FUSE), is still supported.

Architecture

The Volume Driver has also been optimized for performance. The locks in the write path have been revised in order to minimize their impact. More radical is the decision to remove the deduplication functionality from the Volume Driver in order to keep the size of the metadata of the volumes to a strict minimum. By removing the bytes reserved for the hash, we are capable of keeping all the metadata in RAM and push the performance across 1 million IOPS per host on decent hardware. For those who absolutely need deduplication there is still a version available of the Volume Driver which has support for deduplication.

With the breakthrough of RDMA, the network bottleneck is removed and network latency is brought down to a couple of microseconds. Open vStorage makes use of the possibilities RDMA offers to implement a shared cache layer. To achieve this it is now possible to create an ALBA backend out of NVMe or SSD devices. This layer acts as a local, within a single datacenter, cache layer in front of an SATA ALBA backend, the capacity tier, which is spread across multiple datacenters.
This means all SSDs in a single datacenter devise a shared cache for the data of that datacenter. This minimizes the impact of an SSD failure and removes the cold cache effect when moving a volume between hosts. In order to minimize the impact of a single disk failure we introduced the NC-ECC (Network and Clustered Error Correction Codes) algorithm. This algorithm can be compared with solving a Sudoku puzzle. Each SCO, a collection of consecutive writes, is chopped up in chunks. All these chunks are distributed across all the nodes and datacenters in the cluster. The total amount of chunks can be configured but allows for example to recover from a multi node failure or a complete datacenter loss. A failure, whether it is a disk, node or datacenter will cross out some numbers from the complete Sudoku puzzle but as long as you have enough numbers left, you can still solve the puzzle. The same goes for data stored with Open vStorage: as long as you have enough chunks (disk, nodes or datacenters) left, you can always recover the data. The NC-ECC algorithm is based on forward error correction codes and is further optimized for usage within a multi-datacenter approach. When there is a disk or node failure, additional chunks will be created using only data from within the same datacenter. This ensures the bandwidth between datacenters isn’t stressed in case of a simple disk failure.

By splitting up the Edge, the Volume Driver, the cache layer and the capacity tier, you have the ultimate flexibility to build the storage cluster of your needs. You can run everything on the same server, hyperconverged, or you can install each component on a dedicated server to maximize scalability and performance.

The first alpha version of Fargo is now available on the repo.