Arakoon, a battle hardened key-value DB

Arakoon LogoAt Open vStorage we just love Arakoon, our in-house developed key-value DB. It is always consistent and hence prefers to die instead of giving you wrong data. Thrust us, this is a good property if you are building a storage platform. It is also pretty fast, especially in a multi-datacenter topology. And above all, it is battle hardened over 7 years and it is now ROCK. SOLID.

We use Arakoon almost everywhere in Open vStorage. We use it to store the framework model, volume ownership and to keep track of the ALBA backend metadata. So it is time we tell you a bit more about that Arakoon beast. Arakoon is developed by the Open vStorage team and the core has been made available as open-source project on GitHub. It is already battle proven in several of the Open vStorage solutions and projects by technology leaders such as Western Digital and iQIYI, a subsidiary of Baidu.

Arakoon aims to be easy to understand and use, whilst at the same time taking the following features into consideration:

  • Consistency: The system as a whole needs to provide a consistent view on the distributed state. This stems from the experience that eventual consistency is too heavy a burden for a user application to manage. A simple example is the retrieval of the value for a key where you might receive none, one or multiple values depending on the weather conditions. The next question is always: Why don’t I a get a result? Is it because there is no value, or merely because I currently cannot retrieve it?
  • Conditional and Atomic Updates: We don’t need full blown transactions (would be nice to have though), but we do need updates that abort if the state is not what we expect it to be. So at least an atomic conditional update and an atomic multi-update are needed.
  • Robustness: The system must be able to cope with the failure of individual components, without concessions to consistency. However, whenever consistency can no longer be guaranteed, updates must simply fail.
  • Locality Control: When we deploy a system over 2 datacenters, we want guarantees that the entire state is indeed present in both datacenters. This is something we could not get from distributed hash tables using consistent hashing.
  • Healing & Recovery: Whenever a component dies and is subsequently revived or replaced, the system must be able to guide that component towards a situation where that node again fully participates. If this cannot be done fully automatically, then human intervention should be trivial.
  • Explicit Failure: Whenever there is something wrong, failure should propagate quite quickly.

Sounds interesting, right? Let’s share some more details on the Arakoon internals. It is a distributed key/value database. Since it is strongly consistent, it prefers to stop instead of providing out-dated, faulty values even in case multiple components fail. To achieve this Arakoon uses a variation of the Paxos algorithm. An Arakoon cluster consists of a small set of nodes that all contain the full range of key-value pairs in an internal DB. Next to this DB each node contains the transaction log and a transaction queue. While each node of the cluster carries all the data, yet only one node is assigned to be the master node. The master node manages all the updates for all the clients. The nodes in the cluster vote to select the master node. As long as there is a majority to select a master, the Arakoon DB will remain accessible. To make sure the Arakoon DB can survive a datacenter failure the nodes of the cluster are spread across multiple datacenters.

The steps to store a key in Arakoon

Whenever a key is to be stored in the database, following flow is executed:

  1. Updates to the Arakoon database are consistent. An Arakoon client always looks up the master of a cluster and then sends a request to the master. The master node of a cluster has a queue of all client requests. The moment that a request is queued, the master node sends the request to all his slaves and writes the request in the Transaction Log (TLog). When the slaves receive a request, they store this also in their proper TLog and send an acknowledgement to the master.
  2. A master awaits for the acknowledgements of the slaves. When he receives an acknowledgement of half the nodes plus one, the master pushes the key/value pair in its database. In a five node setup (one master and four slaves), the master must receive an acknowledgement of two slaves before he writes his data to the database, since he is also taken into account as node.
  3. After having written his data in his database, the master starts the following request in his queue. When a slave receives this new request, the slaves first write the previous request in their proper database before handling the new request. This way a slave is always certain that the master has successfully written the data in his proper database.

The benefits of using Arakoon

Scalability

Since the metadata of the ALBA backends gets sharded across different Arakoon clusters, scaling the metadata, capacity or performance wise, is as simple as adding more Arakoon nodes. The whole platform has been designed to store gigabytes of metadata without the metadata being a performance bottleneck.

High Availability

It is quite clear that keeping the metadata safe is essential for any storage solution. Arakoon is designed to be used in high available clusters. By default it stores 3 replicas of the metadata but for extra resilience 5-way replication or more can also be configured. These replica’s can even be stored across locations, allowing for a multi-site block storage cluster which can survive a datacenter loss.

Performance

Arakoon was designed with performance in mind. OCaml was selected as programming language for its reliability and performance. OCaml provides powerful and succinct concurrency (cooperative multitasking), a must in distributed environments. To further boost performance a forced master capability is available which makes sure metadata reads are being served by local Arakoon nodes in case of a multi-site block storage cluster. With Arakoon the master node is local so it has a sub-millisecond latency. As an example, Cassandra, another distributed DB which is used in many projects, requires read consistency by reading the data from multiple datacenters. This leads to a latency that is typically higher than 10 milliseconds.

Accelerated ALBA as read cache

read cache performanceWith the Fargo release we introduce a new architecture which moves the read cache from the Volume Driver to the ALBA backend. I already explained the new backend concepts in a previous blog post but I would also like to shed some light on the various reasons why we took the decision to move the read cache to ALBA. An overview:

Performance

Performance is absolutely the main reason why we decided to move the read cache layer to ALBA. It allows us to remove a big performance bottleneck: locks. When the Volume Driver was in charge of the read cache, we used a hash based upon the volume ID and the LBA to find where the data was stored on the SSD of the Storage Router. When new data was added to the cache – on every write – old data in the cache had to be overwritten. In order to evict data from the cache a linked list was used to track the LRU (Least Recently Used) data. Consequently we had to lock the whole SSD for a while. The lock was required as the hash table (volume ID + LBA) and the linked list had to be updated simultaneously. This write lock also causes delay for read requests as the lock prevents data to be safely read. Basically, in order to increase the performance we had to move towards a lockless read cache where data isn’t updated in place.
This is where ALBA comes in. The ALBA backend doesn’t update data in place but uses a log-structured approach where data is always appended. As ALBA stores chunks of the SCOs, writes are consecutive and large in size. This greatly improves the write bandwidth to the SSDs. ALBA also allows to align cores with the ASD processes and underlying SSDs. By making the whole all-flash ALBA backend core aligned, the overhead of process switching can be minimised. Basically all operations on flash are now asynchronous, core aligned and lockless. All these changes allow Open vStorage to be the fastest distributed block store.

Lower impact of an SSD failure

By moving the read cache to the ALBA backend the impact of an SSD failure is much lower. ALBA allows to perform erasure coding across all SSDs of all nodes in the rack or datacenter. This means the read cache is now distributed and the impact of an SSD failure is limited as only a fraction of the cache is lost. So in case a single SSD fails, there is no reason to go the HDD based capacity backend as the reads can still be fulfilled based upon other fragments of the data which are still cached.

Always hot cache

While Open vStorage has always been capable of supporting live migration, we noticed that with previous versions of the architecture the migrate wasn’t always successful due to the cold cache on the new host. By using the new distributed cache approach, we now have have an always hot cache even in case of (live) migrations.

We hope the above reasons proof that we took the right decision by moving the read cache to ALBA backend. Want to see how you configure the ALBA read cache, check out this GeoScale demo.

The Edge, a lightweight block device

edge block storageWhen I present the new Open vStorage architecture for Fargo, I almost always receive the following Edge question:

What is the Edge and why did you develop it?

What is the Edge about?

The Edge is a lightweight software component which can be installed on a Linux host. It exposes a block device API and connects to the Storage Router across the network (TCP/IP or RDMA). Basically the applications believes it talks to a local block device (the Edge) while the volume actually runs on another host (Storage Router).

Why did we develop the Edge?

The reason why we have developed the Edge is quite simple: componentization. With Open vStorage we are mainly dealing with large, multi-petabyte deployments and having this Edge component gives additional benefits in large environments:

Scalability

In large environments you want to be able to scale the compute and storage part independently. In case you run Open vStorage hyper-converged, as advised with earlier versions, this isn’t possible. This has as consequence that if you need more RAM or CPU to run VMs, you had to also invest in more SSDs. With the Edge you can scale compute and storage independent.

Guaranteed performance

With Eugene the Volume Driver, the high performance distributed block layer, was running on the compute host together with the VMs. This results in the VMs and the Volume Driver fighting for the same CPU and RAM resources. This is a typical issue with hyper-converged solutions. The Edge component avoids this problem as it runs on the compute hosts (and requires only a small amount of resources) and the Volume Drivers runs on dedicated nodes and hence provides a predictable and consistent amount of IOPS to the VMs.

Limit the Impact of Updates

Storage software updates are a (storage) administrator’s worst nightmare. In previous Open vStorage versions an update of the Volume Driver required all VMs on that node to be migrated or brought down.With the Edge the Volume Driver can be updated in the background as each Edge/compute host has HA features and can automatically connect to another Volume Driver on request without the need of a VM migration.