At Open vStorage we just love Arakoon, our in-house developed key-value DB. It is always consistent and hence prefers to die instead of giving you wrong data. Thrust us, this is a good property if you are building a storage platform. It is also pretty fast, especially in a multi-datacenter topology. And above all, it is battle hardened over 7 years and it is now ROCK. SOLID.
We use Arakoon almost everywhere in Open vStorage. We use it to store the framework model, volume ownership and to keep track of the ALBA backend metadata. So it is time we tell you a bit more about that Arakoon beast. Arakoon is developed by the Open vStorage team and the core has been made available as open-source project on GitHub. It is already battle proven in several of the Open vStorage solutions and projects by technology leaders such as Western Digital and iQIYI, a subsidiary of Baidu.
Arakoon aims to be easy to understand and use, whilst at the same time taking the following features into consideration:
- Consistency: The system as a whole needs to provide a consistent view on the distributed state. This stems from the experience that eventual consistency is too heavy a burden for a user application to manage. A simple example is the retrieval of the value for a key where you might receive none, one or multiple values depending on the weather conditions. The next question is always: Why don’t I a get a result? Is it because there is no value, or merely because I currently cannot retrieve it?
- Conditional and Atomic Updates: We don’t need full blown transactions (would be nice to have though), but we do need updates that abort if the state is not what we expect it to be. So at least an atomic conditional update and an atomic multi-update are needed.
- Robustness: The system must be able to cope with the failure of individual components, without concessions to consistency. However, whenever consistency can no longer be guaranteed, updates must simply fail.
- Locality Control: When we deploy a system over 2 datacenters, we want guarantees that the entire state is indeed present in both datacenters. This is something we could not get from distributed hash tables using consistent hashing.
- Healing & Recovery: Whenever a component dies and is subsequently revived or replaced, the system must be able to guide that component towards a situation where that node again fully participates. If this cannot be done fully automatically, then human intervention should be trivial.
- Explicit Failure: Whenever there is something wrong, failure should propagate quite quickly.
Sounds interesting, right? Let’s share some more details on the Arakoon internals. It is a distributed key/value database. Since it is strongly consistent, it prefers to stop instead of providing out-dated, faulty values even in case multiple components fail. To achieve this Arakoon uses a variation of the Paxos algorithm. An Arakoon cluster consists of a small set of nodes that all contain the full range of key-value pairs in an internal DB. Next to this DB each node contains the transaction log and a transaction queue. While each node of the cluster carries all the data, yet only one node is assigned to be the master node. The master node manages all the updates for all the clients. The nodes in the cluster vote to select the master node. As long as there is a majority to select a master, the Arakoon DB will remain accessible. To make sure the Arakoon DB can survive a datacenter failure the nodes of the cluster are spread across multiple datacenters.
The steps to store a key in Arakoon
Whenever a key is to be stored in the database, following flow is executed:
- Updates to the Arakoon database are consistent. An Arakoon client always looks up the master of a cluster and then sends a request to the master. The master node of a cluster has a queue of all client requests. The moment that a request is queued, the master node sends the request to all his slaves and writes the request in the Transaction Log (TLog). When the slaves receive a request, they store this also in their proper TLog and send an acknowledgement to the master.
- A master awaits for the acknowledgements of the slaves. When he receives an acknowledgement of half the nodes plus one, the master pushes the key/value pair in its database. In a five node setup (one master and four slaves), the master must receive an acknowledgement of two slaves before he writes his data to the database, since he is also taken into account as node.
- After having written his data in his database, the master starts the following request in his queue. When a slave receives this new request, the slaves first write the previous request in their proper database before handling the new request. This way a slave is always certain that the master has successfully written the data in his proper database.
The benefits of using Arakoon
Since the metadata of the ALBA backends gets sharded across different Arakoon clusters, scaling the metadata, capacity or performance wise, is as simple as adding more Arakoon nodes. The whole platform has been designed to store gigabytes of metadata without the metadata being a performance bottleneck.
It is quite clear that keeping the metadata safe is essential for any storage solution. Arakoon is designed to be used in high available clusters. By default it stores 3 replicas of the metadata but for extra resilience 5-way replication or more can also be configured. These replica’s can even be stored across locations, allowing for a multi-site block storage cluster which can survive a datacenter loss.
Arakoon was designed with performance in mind. OCaml was selected as programming language for its reliability and performance. OCaml provides powerful and succinct concurrency (cooperative multitasking), a must in distributed environments. To further boost performance a forced master capability is available which makes sure metadata reads are being served by local Arakoon nodes in case of a multi-site block storage cluster. With Arakoon the master node is local so it has a sub-millisecond latency. As an example, Cassandra, another distributed DB which is used in many projects, requires read consistency by reading the data from multiple datacenters. This leads to a latency that is typically higher than 10 milliseconds.