Open vStorage High Availability (HA)

Last week I received an interesting question from a customer:

What about High-Availability (HA)? How does Open vStorage protect against failures?

This customer was right to ask that question. In case you run a large scale, multi-petabyte storage cluster, HA should be one of your key concerns. Downtime in such a cluster doesn’t only lead to production loss but might be a real PR disaster or even lead to foreclosure. When end-customers start leaving your service, it can become a slippery slope and before you are aware there is no customer left on your cluster. Hence, asking the HA question beforehand is a best practice for every storage engineer challenged with doing a due diligence of a new storage technology. Over the past few years we already devoted a lot of words to Open vStorage HA so I thought it was time for a summary.

In this blog post I will discuss the different HA scenarios starting from top (the edge) to bottom (the ASD).

The Edge

To start an Edge block device, you need to pass the IP and port of a Storage Router with the vPool of the vDisk. On initial connection the Storage Router will return to the Edge a list of fail-over Storage Routers. The Edge caches this information and switches automatically to another Storage Router in case it can’t communicate with the Storage Router for 15 seconds.
Periodically the Edge also asks the Storage Router to which Storage Router it should connect. This way the Storage Router can instruct the Edge to connect to another Storage Router, for example because the original Storage Router will be shut down.
For more details, check the following blog post about Edge HA.

The Storage Router

The Storage Router also has multiple HA features for the data path. As a vDisk can only be active and owned by a single Volume Driver, the block to object conversion process of the Storage Router, a mechanism is in place to make sure the ownership of the vDisks can be handed over (happy path) or stolen (unhappy path) by another Storage Router. Once the ownership is transferred the volume is started on the new Storage Router and IO requests can be processed. In case the old Storage Router would still try to write to the backend, fencing will kick in which prevents data to be stored on the backend.
The ALBA proxy is responsible for encrypting, compressing and erasure code the Storage Container Objects (SCOs) coming from the Volume Driver and sending the fragments to the ASD processes on the SSD/SATA disks. Each Storage Router also has multiple proxies and can switch between these proxies in cases of issues and timeouts.

The ALBA Backend

An ALBA backend typically consist out of a multiple physical disks across multiple servers. The proxies generate redundant parity fragments via erasure coding which are stored across all devices of the backend. As a result, a device or even a complete server failure doesn’t lead to data loss. On top, backends can be recursively composed. Let’s take as example the case where you have 3 data centers. One could create a (local) backend containing the disks of each data center and create a (global) backend on top of these these (local) backends. Data could for example be replicated 3 times, one copy in each data center, and erasure coded within the data center for storage efficiency. Using this approach a data center outage wouldn’t cause any data loss.

The management path HA

The previous sections of this blog post discussed the HA features of the data path. The management path is also high available. The GUI and API can be reached from all master nodes in the cluster. The metadata is also stored redundantly and is spread across multiple nodes or even data centers. Open vStorage has 2 types of metadata: the volume metadata and the backend metadata. The volume metadata is stored in a networked RocksDB using a master-slave concept. More information about that can be found here and in a video here.
The backend metadata is stored in our own, in-house developed, always consistent key-value store named Arakoon. More info on Arakoon can be found here.

That’s in a nutshell how Open vStorage makes sure a disk, server or data center disaster doesn’t lead to storage downtime.

NSM and ABM, Arakoon teamwork

In an earlier post we shed some light on Arakoon, our own always consistent distributed key-valuedatabase. Arakoon is used in many parts of the Open vStorage platform. One of the use cases is to store the metadata of the native ALBA object store. Do note that ALBA is NOT a general purpose object store but specifically crafted and optimized for Open vStorage. ALBA uses a collection of Arakoon databases to store where and how objects are stored on the disks in the backend. Typically the SCOs and TLogs of each vDisk end up in a separate bucket, a namespace, on the backend. For each object in the namespace there is a manifest that describes where and how the object is stored on the backend. To glue the namespaces, the manifests and the disks in the backend together, ALBA uses 2 types of Arakoon databases: the ALBA Backend Manager (ABM) and one or more NameSpace Managers (NSM).

ALBA Manager

The ALBA Manager (ABM) is the entry point for all ALBA clients which want to store or retrieve something from the backend. The ALBA Manager DB knows which physical disks belong to the backend, which namespaces exist and on which NSM hosts they can be found.
To optimize the Arakoon DB it is loaded with the albamgr plugin, a collection of specific ABM user functions. Typically there is only a single ABM manager in a cluster.

NSM

A NameSpace Manager (NSM) is an Arakoon cluster which holds the manifests for the namespaces assigned to the NSM. Which NSM is managing which namespaces is registered with the ALBA Manager. The NSM is also the remote API offered by the NSM host to manipulate most of the object metadata during normal operation. Its coordinates can be retrieved from the ALBA Manager by (proxy) clients and maintenance agents.

To optimize the Arakoon DB it is loaded with the nsm_host plugin, a collection of specific NSM host user functions. Typically there are multiple NSM clusters for a single ALBA backend. This allows to scale the backend both capacity and performance wise.

IO requests

Let’s have a look at the IO path. Whenever the Volume Driver needs to store an object on the backend, a SCO or a TLog, it hands the object to one of the ALBA proxies on the same host. The ALBA proxy contains an ALBA client which communicates with the ABM to know on which NSM and disks it can store the object. Once the object is stored on the disks, the manifest with the metadata is registered in the NSM. For performance reasons the different fragment of the object and the manifest can be cached by the ALBA proxy.

In case the Volume Driver needs data from the backend, because it is no longer in the write buffer, it request the proxy to fetch the exact data by asked for a SCO location and offset. In case the right fragment are in the fragment cache, the proxy returns the data immediately to the Volume Driver. Otherwise it can use the manifest from the cache or the manifest isn’t in the cache, the proxy contacts the ABM to get the right NSM and from that the manifest. Based upon the manifest the ALBA client fetches the data it needs from the physical disks and provides it to the Volume Driver.

Arakoon, a battle hardened key-value DB

Arakoon LogoAt Open vStorage we just love Arakoon, our in-house developed key-value DB. It is always consistent and hence prefers to die instead of giving you wrong data. Thrust us, this is a good property if you are building a storage platform. It is also pretty fast, especially in a multi-datacenter topology. And above all, it is battle hardened over 7 years and it is now ROCK. SOLID.

We use Arakoon almost everywhere in Open vStorage. We use it to store the framework model, volume ownership and to keep track of the ALBA backend metadata. So it is time we tell you a bit more about that Arakoon beast. Arakoon is developed by the Open vStorage team and the core has been made available as open-source project on GitHub. It is already battle proven in several of the Open vStorage solutions and projects by technology leaders such as Western Digital and iQIYI, a subsidiary of Baidu.

Arakoon aims to be easy to understand and use, whilst at the same time taking the following features into consideration:

  • Consistency: The system as a whole needs to provide a consistent view on the distributed state. This stems from the experience that eventual consistency is too heavy a burden for a user application to manage. A simple example is the retrieval of the value for a key where you might receive none, one or multiple values depending on the weather conditions. The next question is always: Why don’t I a get a result? Is it because there is no value, or merely because I currently cannot retrieve it?
  • Conditional and Atomic Updates: We don’t need full blown transactions (would be nice to have though), but we do need updates that abort if the state is not what we expect it to be. So at least an atomic conditional update and an atomic multi-update are needed.
  • Robustness: The system must be able to cope with the failure of individual components, without concessions to consistency. However, whenever consistency can no longer be guaranteed, updates must simply fail.
  • Locality Control: When we deploy a system over 2 datacenters, we want guarantees that the entire state is indeed present in both datacenters. This is something we could not get from distributed hash tables using consistent hashing.
  • Healing & Recovery: Whenever a component dies and is subsequently revived or replaced, the system must be able to guide that component towards a situation where that node again fully participates. If this cannot be done fully automatically, then human intervention should be trivial.
  • Explicit Failure: Whenever there is something wrong, failure should propagate quite quickly.

Sounds interesting, right? Let’s share some more details on the Arakoon internals. It is a distributed key/value database. Since it is strongly consistent, it prefers to stop instead of providing out-dated, faulty values even in case multiple components fail. To achieve this Arakoon uses a variation of the Paxos algorithm. An Arakoon cluster consists of a small set of nodes that all contain the full range of key-value pairs in an internal DB. Next to this DB each node contains the transaction log and a transaction queue. While each node of the cluster carries all the data, yet only one node is assigned to be the master node. The master node manages all the updates for all the clients. The nodes in the cluster vote to select the master node. As long as there is a majority to select a master, the Arakoon DB will remain accessible. To make sure the Arakoon DB can survive a datacenter failure the nodes of the cluster are spread across multiple datacenters.

The steps to store a key in Arakoon

Whenever a key is to be stored in the database, following flow is executed:

  1. Updates to the Arakoon database are consistent. An Arakoon client always looks up the master of a cluster and then sends a request to the master. The master node of a cluster has a queue of all client requests. The moment that a request is queued, the master node sends the request to all his slaves and writes the request in the Transaction Log (TLog). When the slaves receive a request, they store this also in their proper TLog and send an acknowledgement to the master.
  2. A master awaits for the acknowledgements of the slaves. When he receives an acknowledgement of half the nodes plus one, the master pushes the key/value pair in its database. In a five node setup (one master and four slaves), the master must receive an acknowledgement of two slaves before he writes his data to the database, since he is also taken into account as node.
  3. After having written his data in his database, the master starts the following request in his queue. When a slave receives this new request, the slaves first write the previous request in their proper database before handling the new request. This way a slave is always certain that the master has successfully written the data in his proper database.

The benefits of using Arakoon

Scalability

Since the metadata of the ALBA backends gets sharded across different Arakoon clusters, scaling the metadata, capacity or performance wise, is as simple as adding more Arakoon nodes. The whole platform has been designed to store gigabytes of metadata without the metadata being a performance bottleneck.

High Availability

It is quite clear that keeping the metadata safe is essential for any storage solution. Arakoon is designed to be used in high available clusters. By default it stores 3 replicas of the metadata but for extra resilience 5-way replication or more can also be configured. These replica’s can even be stored across locations, allowing for a multi-site block storage cluster which can survive a datacenter loss.

Performance

Arakoon was designed with performance in mind. OCaml was selected as programming language for its reliability and performance. OCaml provides powerful and succinct concurrency (cooperative multitasking), a must in distributed environments. To further boost performance a forced master capability is available which makes sure metadata reads are being served by local Arakoon nodes in case of a multi-site block storage cluster. With Arakoon the master node is local so it has a sub-millisecond latency. As an example, Cassandra, another distributed DB which is used in many projects, requires read consistency by reading the data from multiple datacenters. This leads to a latency that is typically higher than 10 milliseconds.

Distributed Config Management

Distributed Config ManagementWhen you are managing large clusters, keeping the configuration of every system up to date can be quite a challenge: new nodes are joining the cluster, old nodes need to be replaced, vPools are created and removed, … . In Eugene and earlier versions we relied on simple config files which were located on each node. It should not come as a surprise that in large clusters it proved to be a challenge to keep the config files in sync. Sometime a clusterwide config parameter was updated while one of the nodes was being rebooted. This had as consequence that the update didn’t make it to the node and after the reboot it kept running with an old config.
For Fargo we decided to tackle this problem. The answer: Distributed Config Management.

Distributed Config Management

All config files are now stored in a distributed config management system. When a component starts, it now retrieves the latest configuration settings from the management system. Let’s have a look at how this works in practice. For example a node is down and we remove the vPool from that node. As the vPool was shrunk, the config for that VolumeDriver is removed from the config management system. When the node restarts it will try to get the latest configuration settings for the vPool from the config management system. As there is no config for the removed vPool, the VolumeDriver will no longer serve the vPool. In a first phase we have added support for Arakoon, our beloved and in-house developed distributed key/value store, as distributed config management system. As an alternative to Arakoon, ETCD has been incorporated but do know that in our own deployments we always use Arakoon (hint).

How to change a config parameter:

Changing parameters in the config management system is very easy through the Open vStorage CLI:

  • ovs config list some: List all keys with the given prefix.
  • ovs config edit some-key: Edit that key in your configured editor. If the key doesn’t exist, it will get created.
  • ovs config get some-key: Print the content of the given key.

The distributed config management also contains a key for all scheduled tasks and jobs. To update the default schedule, edit the key /ovs/framework/scheduling/celery and plan the tasks by adding a crontab style schedule.

I like to move it, move it

The vibe at the Open vStorage office is these days best explained by a song of the early nineties:

I like to move it, move it ~ Reel 2 Reel

While the summer time is in most companies a more quiet time, the Open vStorage office is buzzing like a beehive. Allow me to give you a short overview of what is happening:

  • We are moving into our new, larger and stylish offices. The address remains the same but we are moving into a completely remodeled floor of the Idola business center.
  • Next to physically moving desks at the Open vStorage HQ, we are also moving our code from BitBucket to GitHub. We have centralized all our code under https://github.com/openvstorage. To list a few of the projects: Arakoon (our consistent distributed key-value store), ALBA (the Open vStorage default ALternate BAckend) and of course Open vStorage itself. Go check it out!
  • Finishing up our Open vStorage 2.2 GA release.
  • Adding support for RedHat and Cent OS by merging in the Cent-OS branch. There is still some work to do around packaging, testing and upgrades so feel free to give a hand. As this was really a community effort, we owe everyone a big thank you.
  • Working on some very cool features (RDMA anyone?) but let’s keep those for a separate post.
  • Preparation for VMworld (San Francisco) and the OpenStack summit in Tokyo.

As you can see, many things going on at once so prepare for a hot Open vStorage fall!

Open vStorage 1.0.3 Beta

We couldn’t wait until the end of April before pushing out our next release. That is why the Open vStorage Team decided to release today (April 17 2014) a small version with following set of features:

  • Remove a vPool through the GUI: in version 1.0.2 it was already possible to create a vPool and extend the vPool across Hosts through the GUI. To make the whole vPool management process complete and more user friendly, we have added the option to remove a vPool from a Host. This is of course only possible in case there are no vDisks being served from the vPool to a vMachine on that Host.
  • New version of Arakoon: Arakoon 1.7, the distributed key-value store that guarantees consistency above anything else, integrated in Open vStorage had some major rework done. This release also fixes bugs we encountered.
  • Some small improvements include adding an NTP server by default to the Host and displaying the different used ports on the VSA detail page.

The version also includes bug fixes:

  • Issue with connecting to SAIO via S3.
  • Footer displaying incorrect values.
  • Update sshd configuration to avoid client to send its own specific LOCALE settings.
  • Shutting down first node in 4 node setup caused ovs_consumer_volumerouter to halt on node 2.
  • KVM clones are now no longer automatically started.
  • Open vStorage GUI sometimes didn’t list a second VM.
  • Clone from vTemplate on KVM does not work in the absence of “default” network on Archipel.
  • Restrict vTemplate deletion from the Open vStorage GUI when a child vMachine is present.
  • Add vpool without copying ceph.* files now gives error.

To give this version a try, download the software and install with Quality Level Test.