The Open Core Model

It was Paul Dix, Founder and CTO of InfluxDB, that rocked the boat with his opening keynote at the last PerconaLive conference. His talk, titled “The Open Source Business Model is Under Siege”, discussed the existential struggle that open source software companies are facing. The talk is based on his experience building a viable business around open source over the last three and a half years with InfluxDB. You can see the full video here.

Infrastructure Software, a tough market …

Paul is right, building a viable open source company around infrastructure software is hard. Building a company around infrastructure tout court is hard these day. Need some examples? HPE buying storage unicorn Simplivty well below its recent valuation, Tintri doing an IPO as last option, Nutanix keeps piling up the losses quarter after quarter, RethinkDB and Basho shutting down and there are many more examples.

Open Core Model

I can offer only 1 advice for the above companies, It’s never too late to do the next right thing. And that next right thing was for Open vStorage moving away from a pure-play open source business model. Currently Open vStorage goes with the open core model. This means that we have a core distributed block storage project which is open source and free to use. But on the other hand we also have a closed source, commercial Enterprise Edition which adds more functionality to the core.
Maybe the term open core sounds a bit too pejorative. What we release as core is a fully functional distributed block storage platform. Deciding which feature ends up in the core and which in the Enterprise Edition is a difficult assessment. As rule of thumb, the core version should allow small clusters to be set up and operated without data loss and with decent performance. Even block storage clusters which span across multiple data centers can be set up with the core version. Enterprises which are looking to build their company (or part of it) on a service which couldn’t be built without the Open vStorage technology are gently steered towards the Enterprise Edition. These are typically well established, large enterprises which are looking to offer a new or better service to their customers. They also understand that one size doesn’t fit all and they want to be able to fiddle with all bells and whistles of Open vStorage. They want for example full control over which vDisk is using which part of the distributed cache. Or they want best in class performance and to achieve this they need features like the High Performance Read Mesh. Over time the list of ‘Enterprise Edition only’- features will grow. On the other hand nothing prevents us from moving features from the Enterprise Edition to the open source version down the line.

A final note

The open core model might offend some people. Yet, we aren’t the only one operating under an open core model. The open core business model is for example also used by Docker, MySQL, InfluxDB, MongoDB, Puppet, Midokura and many, many other software companies. It isn’t an easy business model as there is always discussion on which features to release as part of the open source project and which as part of the Enterprise Edition. But, we are confident that the open core model is the path forward. Not only for us but also for the whole software infrastructure market.

PS: Keep following our blog as over the next few weeks we will demonstrate the success of our open core business model with some extensive, multi petabyte, multi data center implementations.

Cache Policy Management: A Closer Look

Don’t you hate a noisy neighbour? Someone who blasts his preferred music just loud enough so you can hear it when trying to get some sleep or having a relaxing commute. Well the same goes for noisy neighbours in storage. It is not their deafening music that is annoying but the fact that other volumes can’t meet their desired performance as one volume gobbles up all IOPS.

Setting cache quota

This situation typically occurs when a single volume takes up the whole cache. In order to allow every vDisk to get a fair share of the cache, the Open vStorage Enterprise Edition allows to put a quota on the cache usage. When creating a vPool you can set a default quota per vDisk allowing each vDisk to get a fair share of the cache. Do note that the quota system is flexible. It is for example possible to set a larger value than the default for a specific vDisk in case it would benefit from more caching. It is even possible to oversubscribe the cache. This way the cache space can be optimally used.

Block and Fragment cache

One more point about cache management in Open vStorage. There are actually 2 types of cache which can be configured in Open vStorage. The first one caches complete fragments, the result of erasure coding a Storage Container Object (SCO). Hence it is called the fragment cache and it is typically used for newly written data. The stored fragments are typically large in size as to limit the amount of metadata and consequently these aren’t ideal to be used for (read) caching. The cache hit ratio is under normal circumstances inversely proportional to the size of the fragments. For that reason another cache, specifically tuned for read caching, was added. This block cache gets filled on reads and limits the size of the blocks in the cache to a couple of KB (f.e. 32-256KB). This means a more granular approach can be taken during cache eviction, eventually leading to a higher cache hit ratio.

The Open vStorage High Performance Read Mesh (HPRM)

When you are developing a storage solution your biggest worry is data loss. As an Open vStorage platform can lose a server or even a complete data center without actual data loss, we are pretty sure we have that base covered. The next challenge is to make sure that safely stored data can be quickly accessed when needed. In this blog section we already discussed a lot of the performance improvements we made over the past releases. We introduced the Edge component for guaranteed performance, the accelerated ALBA as read cache, multiple proxies per volume driver and various performance tuning options.

Today it is time to introduce the latest performance improvement: High Performance Read Mesh (HPRM). This HPRM is an optimization of the read path and allows the compute host to directly fetch the data from the drives where the data is located. Earlier the read path always had to go through the Volume Driver before the data was fetched from the ASD. This newly introduced short read path can only be taken in case the Edge has the necessary metadata of where (SCO, fragment, disk) each LBA’s data is stored. In case the Edge doesn’t have the needed metadata, for example because the cached metadata is outdated, the slow path is taken through the Volume Driver. For the write path nothing is changed as all writes go through the Volume Driver.

The short read path which bypasses the Volume Driver has 2 direct advantages: lower latency on reads and less network traffic as data only goes once over the network. Next, the introduction of the HPRM also allows for a cost reduction on the hardware front. Since the hosts running the Volume Driver are no longer in the read path in many cases, they are freed up and can focus on processing incoming writes. This means the ratio between compute hosts running the Edge and the Volume Driver can be increased. Since the Volume Driver hosts are typically beefy servers with expensive NVMe devices for the write buffer and the distributed databases, a significant change in the Compute/Volume Driver ratio means a significant reduction of the hardware cost.

HPRM, the technical details

Let’s have a look under the hood on how the HPRM works. First we will have a look at the write path. The application, f.e. the hypervisor, writes to the block device exposed by the Edge client. The Edge client will connect to its server part which in its turn, writes the data to the write buffer of the Volume Driver. Once enough writes are accumulated in the buffer, a SCO (Storage Container Object) is created and dispatched to the ALBA backend through the proxy. The proxy makes sure the data is spread across different ASDs according to the specified ALBA preset. Which ASDs contain the fragments of the SCO is stored in a manifest.
Once a read comes for the LBA, the Edge client will check its local metadata cache for the SCO info and manifest of the SCO. If the info is available the Edge will get the LBA data through the PRACC (Partial Read ACCelerator) client which can directly fetch the data from the ASDs. If the info isn’t available in the cache or if it is outdated, the manifest and SCO info are retrieved by the Edge client from the Volume Driver and stored in the Edge metadata cache.
The Edge also pushes the IO statistics to the Volume Driver so these can be queried by the Framework or the monitoring components. Gathering IO statistics is done by the Edge as it is the only component that has a view on both the fast path, through the PRACC, and the slow path through the Volume Driver.


Note that the High Performance Read Mesh is part of the Open vStorage Enterprise Edition. Contact us for more info on the Open vStorage Enterprise Edition.

Connecting Open vStorage with Amazon

In an earlier blog post we already discussed that Open vStorage is the storage solution to implement a hybrid cloud. In this blog post we will explain the technical details on how Open vStorage can be used in a hybrid cloud context.

The components

For frequent readers of this blog the different Open vStorage components should not hold any secrets anymore. For newcomers we will give a short overview of the different components:

  • The Edge: a lightweight software component which exposes a block device API and connects across the network to the Volume Driver.
  • The Volume Driver: a log structured volume manager which converts blocks into objects.
  • The ALBA Backend: an object store optimized as backend for the Volume Driver.

Let’s see how these components fit together in a hybrid cloud context.

The architecture

The 2 main components of any hybrid cloud are an on-site, private part and a public part. Key in a hybrid cloud is that data and compute can move between the private and the public part as needed. As part of this thought exercise we take the example where we want to store data on premises in our private cloud and burst with compute into the public cloud when needed. To achieve this we need to install the components as follows:

The Private Cloud part
In the private cloud we install the ALBA backend components to create one or more storage pools. All SATA disks are gathered in a capacity backend while the SSD devices are gathered in a performance backend which accelerates the capacity backend. On top of these storage pools we will deploy one or more vPools. To achieve this we run a couple of Volume Driver instances inside our private cloud. On-site compute nodes with the Edge component installed can use these Volume Drivers to store data on the capacity backend.

The Public Cloud part
For the Public Cloud part, let’s assume we use Amazon AWS, there are multiple options depending on the desired performance. In case we don’t require a lot of performance we can use an Amazon EC2 instance with KVM and the Edge installed. To bring a vDisk live in Amazon, a connection is made across the internet With the Volume Driver in the private cloud. Alternatively an AWS Direct Connect link can be used for a lower latency connection. Writes to Vdisk which is exposed in Amazon will be sent by the Edge to the write buffer of the Volume Driver. This means that writes will only be acknowledged to the application using the vDisk once the on premises located write buffer has received the data. Since the Edge and the Volume Driver connect over a rather high latency link, the write performance isn’t optimal in this case.
In case more performance is required we need an additional Storage Optimized EC2 instance with one or more NVMe SSDs. In this second EC2 instance a Volume Driver instance is installed and the vPool is extended from the on-site, private cloud into Amazon. The NVMe devices of the EC2 instance are used to store the write buffer and the metadata DBs. It is of course possible to add some more EBS Provisioned IOPS SSDs to the EC2 instance as read cache. For an even better performance, use dedicated Open vStorage powered cache nodes in Amazon. Since the write buffer is located in Amazon the latency will be substantially lower than in the first setup.

Use cases

As last part of this blog post we want to discuss some use cases which can be deployed on top of this hybrid cloud.

Analytics
Note that based upon the above architecture, a vDisk in the private cloud can be cloned into Amazon. The cloned vDisk can be used for business analytics inside Amazon without impacting the live workloads. When the analytics query is finished, the clone can be removed. The other way around is of course also possible. In that case the application data is stored in Amazon while the business analytics run on on-site compute hardware.

Disaster Recovery
Another use case is disaster recovery. As disaster recovery requires data to be on premises but also in the cloud additional instance need to be added with a large amount of HDD disks. Replication or erasure coding can be used to spread the data across the private and public cloud. In case of a disaster where the private cloud is destroyed, one can just add more compute instances running the Edge to bring the workloads live in the public cloud.

Data Safety
A last use case we want to highlight is for users that want to use public clouds but don’t thrust these public cloud providers with all of their data. In that case you need to get some instances in each public cloud which are optimized for storing data. Erasure coding is used to chop the data in encrypted fragments. These fragments are spread across the public clouds in such a way that non of the public clouds store the complete data set while the Edges and the Volume Drivers still can see the whole data set.

NSM and ABM, Arakoon teamwork

In an earlier post we shed some light on Arakoon, our own always consistent distributed key-valuedatabase. Arakoon is used in many parts of the Open vStorage platform. One of the use cases is to store the metadata of the native ALBA object store. Do note that ALBA is NOT a general purpose object store but specifically crafted and optimized for Open vStorage. ALBA uses a collection of Arakoon databases to store where and how objects are stored on the disks in the backend. Typically the SCOs and TLogs of each vDisk end up in a separate bucket, a namespace, on the backend. For each object in the namespace there is a manifest that describes where and how the object is stored on the backend. To glue the namespaces, the manifests and the disks in the backend together, ALBA uses 2 types of Arakoon databases: the ALBA Backend Manager (ABM) and one or more NameSpace Managers (NSM).

ALBA Manager

The ALBA Manager (ABM) is the entry point for all ALBA clients which want to store or retrieve something from the backend. The ALBA Manager DB knows which physical disks belong to the backend, which namespaces exist and on which NSM hosts they can be found.
To optimize the Arakoon DB it is loaded with the albamgr plugin, a collection of specific ABM user functions. Typically there is only a single ABM manager in a cluster.

NSM

A NameSpace Manager (NSM) is an Arakoon cluster which holds the manifests for the namespaces assigned to the NSM. Which NSM is managing which namespaces is registered with the ALBA Manager. The NSM is also the remote API offered by the NSM host to manipulate most of the object metadata during normal operation. Its coordinates can be retrieved from the ALBA Manager by (proxy) clients and maintenance agents.

To optimize the Arakoon DB it is loaded with the nsm_host plugin, a collection of specific NSM host user functions. Typically there are multiple NSM clusters for a single ALBA backend. This allows to scale the backend both capacity and performance wise.

IO requests

Let’s have a look at the IO path. Whenever the Volume Driver needs to store an object on the backend, a SCO or a TLog, it hands the object to one of the ALBA proxies on the same host. The ALBA proxy contains an ALBA client which communicates with the ABM to know on which NSM and disks it can store the object. Once the object is stored on the disks, the manifest with the metadata is registered in the NSM. For performance reasons the different fragment of the object and the manifest can be cached by the ALBA proxy.

In case the Volume Driver needs data from the backend, because it is no longer in the write buffer, it request the proxy to fetch the exact data by asked for a SCO location and offset. In case the right fragment are in the fragment cache, the proxy returns the data immediately to the Volume Driver. Otherwise it can use the manifest from the cache or the manifest isn’t in the cache, the proxy contacts the ABM to get the right NSM and from that the manifest. Based upon the manifest the ALBA client fetches the data it needs from the physical disks and provides it to the Volume Driver.

Fargo GA

After 3 Release Candidates and extensive testing, the Open vStorage team is proud to announce the GA (General Availability) release of Fargo. This release is packed with new features. Allow us to give a small overview:

NC-ECC presets (global and local policies)

NC-ECC (Network Connected-Error Correction Code) is an algorithm to store Storage Container Objects (SCOs) safely in multiple data centers. It consists out of a global, across data center, preset and multiple local, within a single data center, presets. The NC-ECC algorithm is based on forward error correction codes and is further optimized for usage with a multi data center approach. When there is a disk or node failure, additional chunks will be created using only data from within the same data center. This ensures the bandwidth between data centers isn’t stressed in case of a simple disk failure.

Multi-level ALBA

The ALBA backend now supports different levels. An all SSD ALBA backend can be used as performance layer in front of the capacity tier. Data is removed from the cache layer using a random eviction or Least Recently Used (LRU) strategy.

Open vStorage Edge

The Open vStorage Edge is a lightweight block driver which can be installed on Linux hosts and connect with the Volume Driver over the network (TCP-IP). By creating different components for the Volume Driver and the Edge compute and storage can scale independently.

Performance optimized Volume Driver

By limiting the size of a volume’s metadata, the metadata now fits completely in RAM. To keep the metadata at an absolute minimum, deduplication was removed. You can read more about why we removed deduplication here. Other optimizations are multiple proxies per Volume Driver (the default amount is 2), bypassing the proxy and go straight from the Volume Driver to the ASD in case of partial reads, local read preference in case of global backends (try to read from ASDs in the same data center instead of going over the network to another data center).

Multiple ASDs per device

For low latency devices adding multiple ASDs per device provides a higher bandwidth to the device.

Distributed Config Management

When you are managing large clusters, keeping the configuration of every system up to date can be quite a challenge. With Fargo all config files are now stored in a distributed config management system on top of our distributed database, Arakoon. More info can be found here.

Ubuntu 16.04

Open vStorage is now supported on Ubuntu 16.04, the latest Long Term Support (LTS) version of Ubuntu.

Smaller features in Fargo:

  • Improved the speed of the non-cached API and GUI queries by a factor 10 to 30.
  • Hardening the remove node procedure.
  • The GUI is adjusted to better highlight clusters which are spread across multiple sites.
  • The failure domain concept has been replaced by tag based domains. ASD nodes and storage routers can now be tagged with one or more tags. Tags can be used to identify a rack, site, power feed, etc.
  • 64TB volumes.
  • Browsable API with Swagger.
  • ‘asd-manager collect logs’ identical to the ‘ovs collect logs’.
  • Support for the removal of the ads-manager packages.

Since this Fargo release introduces a completely new architecture (you can read more about it here) there is no upgrade possible between Eugene and Fargo. The full release notes can be found here.

Hybrid cloud, the phoenix of cloud computing

Introduction


Hybrid cloud, an integration between both on-site, private and public clouds, has been declared dead many times over the past few years but like a phoenix it keeps on resurrecting in the yearly IT technology and industry forecasts.

Limitations, hurdles and issues

Let’s first have a look at the numerous reasons why the hybrid cloud computing trend hasn’t taken off (yet):

  • Network limitations: connecting to a public cloud was often cumbersome as it requires all traffic to go over slow, high latency public internet links.
  • Storage hurdles: implementing a hybrid cloud approach means storing data multiple times and keeping these multiple copies in sync.
  • Integration complexity: each cloud, whether private or public, has its own interface and standards which make integration unnecessary difficult and complex.
  • Legacy IT: existing on-premise infrastructure is a reality and holds back a move to the public cloud. Next to the infrastructure component, applications were not built or designed in such a way that you can scale them up and down. Nor are they designed to store their data in an object store.

Taking the above into account it shouldn’t come as a surprise that many enterprises saw public cloud computing as a check-in at Hotel California. The technical difficulties and the cost and the risk of moving back and forth between clouds was just too big. But times are changing. According to McKinsey & Company, a leading management consulting firm, over the next 3 years enterprises are planning to transition IT workloads at a significant rate and pace to a hybrid cloud infrastructure.

Hybrid cloud (finally) taking off

I see a couple a reasons why the hybrid cloud approach is finally taking off:

Edge computing use case
Smart ‘devices’ such as self driving cars are producing such large amounts of data that they can’t rely on public clouds to process it all. The data sometimes even drives real-time decisions where latency might be the difference between life or dead. Evolutionary, this requires that computing power shifts to the edges of the network. This Edge or Fog Computing concept is a textbook example of a hybrid cloud where on-site, or should we call it on-board, computing and centralized computing are grouped together into a single solution.

The network limitations are removed
The network limitations have been removed by services like AWS Direct Connect. With these you have a dedicated network connection from your premises to the Amazon cloud. All big cloud providers now offer the option for a dedicated network into their cloud. Pricing for dedicated 10GbE links in metropolitan regions like New York have also dropped significantly. For under $1.000 a month you can now get a sub millisecond fibre connection from most building in New York to one of the many data centers in New York.

Recovery realisation
More and more enterprises with a private cloud realise the need for a disaster recovery plan.
In the past this meant getting a second private cloud. This approach multiplies the TCO by at least a factor 2 as twice the amount of hardware needs to be purchased. Keeping both private clouds in sync makes disaster recovery plans only more complex. Instead of making disaster recovery a cost, enterprises are now turning disaster recovery into an asset instead of a cost. Enterprises now use cheap, public cloud storage to store their off-site backups and copies. By adding compute capacity in peak periods or when disaster strikes they can bring these off-site copies online when needed. On top, additional business analytics can also use these off-site copies without impacting the production workloads.

Standardization
Over the past years standards in cloud computing have crystallized. In the public cloud Amazon has set the standard for storing unstructured data. On the private infrastructure side, the OpenStack ecosystem has made significant progress in streamlining and standardizing how complete clouds are deployed. Enterprises such as Cisco for example are now focussing on new services to manage and orchestrate clouds in order to smooth out the last bumps in the migration between different clouds.

Storage & legacy hardware: the problem children

Based upon the previous paragraphs one might conclude that all obstacles to move to the hybrid model have been cleared. This isn’t the case as 2 issues still strike up:.

The legacy hardware problem
All current public cloud computing solutions ignore the reality that enterprises have a hardware legacy. While starting from scratch is the easiest solution, it is definitely not the cheapest. In order for the hybrid cloud to be successful, existing hardware must in some form or shape be able to be integrated in the hybrid cloud.

Storage roadblocks remain
In case you want to make use of multiple cloud solutions, the only solution you have is to store a copy of each bit of data in each cloud. This x-way replication scheme solves the issue of data being available in all cloud locations but it solves it at a high cost. Next to the replication cost, replication also adds significant latency as writes can only be acknowledged if all location are up to date. This means that in case replication is used hybrid clouds which span the east and west coast of the US are not workable.

Open vStorage removes those last obstacles

Open vStorage, a software based storage solution, allows multi-datacenter block storage in a much more nimble and cost-effective way than any traditional solution. This way it removes the last roadblocks towards the hybrid cloud adoption.
Solving the storage puzzle
Instead of X-way replication Open vStorage uses a different approach which can be compared to solving a Sudoku puzzle. All data is chopped up in chunks and additionally some parity chunks are adjoined. All these chunks, the data and parity chunks, are distributed across all the nodes, datacenters and clouds in the cluster. The amount of parity chunks can be configured but allows for example to recover from a multi node failure or a complete data center loss. A failure, whether it is a disk, node or data center will cross out some numbers from the complete Sudoku puzzle but as long as you have enough numbers left, you can still solve the puzzle. The same goes for data stored with Open vStorage: as long as you have enough chunks (disks, nodes, data centers or clouds) left, you can always recover the data.
Unlike X-way replication where data is only acknowledged once all copies are stored safely, Open vStorage allows to store data sub-optimally. This has as big advantage that it allows to acknowledge writes in case not all data chunks are written to disk. This makes sure that a single slow disk, datacenter or cloud, doesn‘t detain applications and incoming writes. This approach lowers the write latency while keeping data safety at a high level.

Legacy hardware
Open vStorage also allows to include legacy storage hardware. As Open vStorage is a software based storage solution, it can turn any x86 hardware into a piece of the hybrid storage cloud.
Open vStorage leverages the capabilities of new media technologies like SSDs and PCI-e flash but also those of older technologies like large capacity traditional SATA drives. For applications that need above par performance additional SSDs and PCI-e flash cards can be added.

Summary

Hybrid Cloud has long been a model that was chased by many enterprises without any luck. Issues such as network and storage limitations and integration complexity have been major roadblocks on the hybrid cloud path. Over the last few years a lot of these roadblocks have been removed but issues with storage and legacy hardware remained. Open vStorage overcomes these last obstacles and paves the path towards hybrid cloud adoption.

Keeping an eye on an Open vStorage cluster

Open vStorage offers as part of the commercial package 2 options to monitor an Open vStorage cluster. The OPS team acts as a second set of eyes or the OPS team has the keys, is in the driving seat and has full control. In both cases these large scale (+5PB) Open vStorage clusters send the logs to a centralized monitoring cluster managed by the OPS team. This custom monitoring cluster is based based upon scalable tools such as Elasticsearch, InfluxDB, Kibana, Grafana and CheckMK. Let’s have a look at the different components the OPS team uses. Note that these tools are only part of the Open vStorage commercial package.

Elasticsearch & Kibana

To expose the internals of an Open vStorage cluster, the team opted to run an ELK (Elasticsearach, Logstash, Kibana) stack to gather logging information and centralise all this information into a single viewing pane.

The ELK-stack consists of 3 open source components:

  • Elasticsearch: a NoSQL database, based on Apache’s Lucene engine, which stores all log files.
  • Logstash: a log pipeline tool which accepts various inputs and targets. In our case, it will read logging from a Redis queue and store them into Elasticsearch.
  • Kibana: a visualisation tool on top of Elasticsearch.

Next to the ELK stack, Journalbeat is used to fetch the logging from all nodes of the cluster and put them onto Redis. Logstash consumes the Redis queue and stores the log messages into Elasticsearch. By aggregating all logs from a cluster into a single, unified view, detecting anomalies or finding correlation between issues is easier.

InfluxDB & Grafana

The many statistics that are being tracked are stored into an InfluxDB, an open source database specifically designed to handle time series data. On top of the InfluxDB Grafana is used to visualize these statistics. The dashboards give a detailed view on the performance metrics of the cluster as a whole but also of the individual components. The statistics are provided in an aggregated view but a OPS member can also drill down to the smallest detail such as the individual vDisks level. The metrics that are tracked range from IO latency at different levels, throughput and operations per second, safety of the objects in the backend to the amount of maintenance tasks that are running across the cluster.

CheckMK

To detect and escalate issues the Open vStorage team uses CheckMK, an extension to the open source Nagios monitoring system. The CheckMK cluster is loaded with many monitoring rules based upon years of experience in monitoring large scale (storage) clusters. These monitoring rules includes general checks such as the CPU and RAM of a host, the services, network performance and disk health but of course specific checks for Open vStorage components such as the Volume Driver or Arakoon have also been added. The output of the healthcheck also gets parsed by the CheckMK engine. In case of issues a fine-tuned escalation process is put into motion in order to resolve these issues quickly.

Arakoon, a battle hardened key-value DB

Arakoon LogoAt Open vStorage we just love Arakoon, our in-house developed key-value DB. It is always consistent and hence prefers to die instead of giving you wrong data. Thrust us, this is a good property if you are building a storage platform. It is also pretty fast, especially in a multi-datacenter topology. And above all, it is battle hardened over 7 years and it is now ROCK. SOLID.

We use Arakoon almost everywhere in Open vStorage. We use it to store the framework model, volume ownership and to keep track of the ALBA backend metadata. So it is time we tell you a bit more about that Arakoon beast. Arakoon is developed by the Open vStorage team and the core has been made available as open-source project on GitHub. It is already battle proven in several of the Open vStorage solutions and projects by technology leaders such as Western Digital and iQIYI, a subsidiary of Baidu.

Arakoon aims to be easy to understand and use, whilst at the same time taking the following features into consideration:

  • Consistency: The system as a whole needs to provide a consistent view on the distributed state. This stems from the experience that eventual consistency is too heavy a burden for a user application to manage. A simple example is the retrieval of the value for a key where you might receive none, one or multiple values depending on the weather conditions. The next question is always: Why don’t I a get a result? Is it because there is no value, or merely because I currently cannot retrieve it?
  • Conditional and Atomic Updates: We don’t need full blown transactions (would be nice to have though), but we do need updates that abort if the state is not what we expect it to be. So at least an atomic conditional update and an atomic multi-update are needed.
  • Robustness: The system must be able to cope with the failure of individual components, without concessions to consistency. However, whenever consistency can no longer be guaranteed, updates must simply fail.
  • Locality Control: When we deploy a system over 2 datacenters, we want guarantees that the entire state is indeed present in both datacenters. This is something we could not get from distributed hash tables using consistent hashing.
  • Healing & Recovery: Whenever a component dies and is subsequently revived or replaced, the system must be able to guide that component towards a situation where that node again fully participates. If this cannot be done fully automatically, then human intervention should be trivial.
  • Explicit Failure: Whenever there is something wrong, failure should propagate quite quickly.

Sounds interesting, right? Let’s share some more details on the Arakoon internals. It is a distributed key/value database. Since it is strongly consistent, it prefers to stop instead of providing out-dated, faulty values even in case multiple components fail. To achieve this Arakoon uses a variation of the Paxos algorithm. An Arakoon cluster consists of a small set of nodes that all contain the full range of key-value pairs in an internal DB. Next to this DB each node contains the transaction log and a transaction queue. While each node of the cluster carries all the data, yet only one node is assigned to be the master node. The master node manages all the updates for all the clients. The nodes in the cluster vote to select the master node. As long as there is a majority to select a master, the Arakoon DB will remain accessible. To make sure the Arakoon DB can survive a datacenter failure the nodes of the cluster are spread across multiple datacenters.

The steps to store a key in Arakoon

Whenever a key is to be stored in the database, following flow is executed:

  1. Updates to the Arakoon database are consistent. An Arakoon client always looks up the master of a cluster and then sends a request to the master. The master node of a cluster has a queue of all client requests. The moment that a request is queued, the master node sends the request to all his slaves and writes the request in the Transaction Log (TLog). When the slaves receive a request, they store this also in their proper TLog and send an acknowledgement to the master.
  2. A master awaits for the acknowledgements of the slaves. When he receives an acknowledgement of half the nodes plus one, the master pushes the key/value pair in its database. In a five node setup (one master and four slaves), the master must receive an acknowledgement of two slaves before he writes his data to the database, since he is also taken into account as node.
  3. After having written his data in his database, the master starts the following request in his queue. When a slave receives this new request, the slaves first write the previous request in their proper database before handling the new request. This way a slave is always certain that the master has successfully written the data in his proper database.

The benefits of using Arakoon

Scalability

Since the metadata of the ALBA backends gets sharded across different Arakoon clusters, scaling the metadata, capacity or performance wise, is as simple as adding more Arakoon nodes. The whole platform has been designed to store gigabytes of metadata without the metadata being a performance bottleneck.

High Availability

It is quite clear that keeping the metadata safe is essential for any storage solution. Arakoon is designed to be used in high available clusters. By default it stores 3 replicas of the metadata but for extra resilience 5-way replication or more can also be configured. These replica’s can even be stored across locations, allowing for a multi-site block storage cluster which can survive a datacenter loss.

Performance

Arakoon was designed with performance in mind. OCaml was selected as programming language for its reliability and performance. OCaml provides powerful and succinct concurrency (cooperative multitasking), a must in distributed environments. To further boost performance a forced master capability is available which makes sure metadata reads are being served by local Arakoon nodes in case of a multi-site block storage cluster. With Arakoon the master node is local so it has a sub-millisecond latency. As an example, Cassandra, another distributed DB which is used in many projects, requires read consistency by reading the data from multiple datacenters. This leads to a latency that is typically higher than 10 milliseconds.

Accelerated ALBA as read cache

read cache performanceWith the Fargo release we introduce a new architecture which moves the read cache from the Volume Driver to the ALBA backend. I already explained the new backend concepts in a previous blog post but I would also like to shed some light on the various reasons why we took the decision to move the read cache to ALBA. An overview:

Performance

Performance is absolutely the main reason why we decided to move the read cache layer to ALBA. It allows us to remove a big performance bottleneck: locks. When the Volume Driver was in charge of the read cache, we used a hash based upon the volume ID and the LBA to find where the data was stored on the SSD of the Storage Router. When new data was added to the cache – on every write – old data in the cache had to be overwritten. In order to evict data from the cache a linked list was used to track the LRU (Least Recently Used) data. Consequently we had to lock the whole SSD for a while. The lock was required as the hash table (volume ID + LBA) and the linked list had to be updated simultaneously. This write lock also causes delay for read requests as the lock prevents data to be safely read. Basically, in order to increase the performance we had to move towards a lockless read cache where data isn’t updated in place.
This is where ALBA comes in. The ALBA backend doesn’t update data in place but uses a log-structured approach where data is always appended. As ALBA stores chunks of the SCOs, writes are consecutive and large in size. This greatly improves the write bandwidth to the SSDs. ALBA also allows to align cores with the ASD processes and underlying SSDs. By making the whole all-flash ALBA backend core aligned, the overhead of process switching can be minimised. Basically all operations on flash are now asynchronous, core aligned and lockless. All these changes allow Open vStorage to be the fastest distributed block store.

Lower impact of an SSD failure

By moving the read cache to the ALBA backend the impact of an SSD failure is much lower. ALBA allows to perform erasure coding across all SSDs of all nodes in the rack or datacenter. This means the read cache is now distributed and the impact of an SSD failure is limited as only a fraction of the cache is lost. So in case a single SSD fails, there is no reason to go the HDD based capacity backend as the reads can still be fulfilled based upon other fragments of the data which are still cached.

Always hot cache

While Open vStorage has always been capable of supporting live migration, we noticed that with previous versions of the architecture the migrate wasn’t always successful due to the cold cache on the new host. By using the new distributed cache approach, we now have have an always hot cache even in case of (live) migrations.

We hope the above reasons proof that we took the right decision by moving the read cache to ALBA backend. Want to see how you configure the ALBA read cache, check out this GeoScale demo.