Keeping an eye on an Open vStorage cluster

Open vStorage offers as part of the commercial package 2 options to monitor an Open vStorage cluster. The OPS team acts as a second set of eyes or the OPS team has the keys, is in the driving seat and has full control. In both cases these large scale (+5PB) Open vStorage clusters send the logs to a centralized monitoring cluster managed by the OPS team. This custom monitoring cluster is based based upon scalable tools such as Elasticsearch, InfluxDB, Kibana, Grafana and CheckMK. Let’s have a look at the different components the OPS team uses. Note that these tools are only part of the Open vStorage commercial package.

Elasticsearch & Kibana

To expose the internals of an Open vStorage cluster, the team opted to run an ELK (Elasticsearach, Logstash, Kibana) stack to gather logging information and centralise all this information into a single viewing pane.

The ELK-stack consists of 3 open source components:

  • Elasticsearch: a NoSQL database, based on Apache’s Lucene engine, which stores all log files.
  • Logstash: a log pipeline tool which accepts various inputs and targets. In our case, it will read logging from a Redis queue and store them into Elasticsearch.
  • Kibana: a visualisation tool on top of Elasticsearch.

Next to the ELK stack, Journalbeat is used to fetch the logging from all nodes of the cluster and put them onto Redis. Logstash consumes the Redis queue and stores the log messages into Elasticsearch. By aggregating all logs from a cluster into a single, unified view, detecting anomalies or finding correlation between issues is easier.

InfluxDB & Grafana

The many statistics that are being tracked are stored into an InfluxDB, an open source database specifically designed to handle time series data. On top of the InfluxDB Grafana is used to visualize these statistics. The dashboards give a detailed view on the performance metrics of the cluster as a whole but also of the individual components. The statistics are provided in an aggregated view but a OPS member can also drill down to the smallest detail such as the individual vDisks level. The metrics that are tracked range from IO latency at different levels, throughput and operations per second, safety of the objects in the backend to the amount of maintenance tasks that are running across the cluster.

CheckMK

To detect and escalate issues the Open vStorage team uses CheckMK, an extension to the open source Nagios monitoring system. The CheckMK cluster is loaded with many monitoring rules based upon years of experience in monitoring large scale (storage) clusters. These monitoring rules includes general checks such as the CPU and RAM of a host, the services, network performance and disk health but of course specific checks for Open vStorage components such as the Volume Driver or Arakoon have also been added. The output of the healthcheck also gets parsed by the CheckMK engine. In case of issues a fine-tuned escalation process is put into motion in order to resolve these issues quickly.

Arakoon, a battle hardened key-value DB

Arakoon LogoAt Open vStorage we just love Arakoon, our in-house developed key-value DB. It is always consistent and hence prefers to die instead of giving you wrong data. Thrust us, this is a good property if you are building a storage platform. It is also pretty fast, especially in a multi-datacenter topology. And above all, it is battle hardened over 7 years and it is now ROCK. SOLID.

We use Arakoon almost everywhere in Open vStorage. We use it to store the framework model, volume ownership and to keep track of the ALBA backend metadata. So it is time we tell you a bit more about that Arakoon beast. Arakoon is developed by the Open vStorage team and the core has been made available as open-source project on GitHub. It is already battle proven in several of the Open vStorage solutions and projects by technology leaders such as Western Digital and iQIYI, a subsidiary of Baidu.

Arakoon aims to be easy to understand and use, whilst at the same time taking the following features into consideration:

  • Consistency: The system as a whole needs to provide a consistent view on the distributed state. This stems from the experience that eventual consistency is too heavy a burden for a user application to manage. A simple example is the retrieval of the value for a key where you might receive none, one or multiple values depending on the weather conditions. The next question is always: Why don’t I a get a result? Is it because there is no value, or merely because I currently cannot retrieve it?
  • Conditional and Atomic Updates: We don’t need full blown transactions (would be nice to have though), but we do need updates that abort if the state is not what we expect it to be. So at least an atomic conditional update and an atomic multi-update are needed.
  • Robustness: The system must be able to cope with the failure of individual components, without concessions to consistency. However, whenever consistency can no longer be guaranteed, updates must simply fail.
  • Locality Control: When we deploy a system over 2 datacenters, we want guarantees that the entire state is indeed present in both datacenters. This is something we could not get from distributed hash tables using consistent hashing.
  • Healing & Recovery: Whenever a component dies and is subsequently revived or replaced, the system must be able to guide that component towards a situation where that node again fully participates. If this cannot be done fully automatically, then human intervention should be trivial.
  • Explicit Failure: Whenever there is something wrong, failure should propagate quite quickly.

Sounds interesting, right? Let’s share some more details on the Arakoon internals. It is a distributed key/value database. Since it is strongly consistent, it prefers to stop instead of providing out-dated, faulty values even in case multiple components fail. To achieve this Arakoon uses a variation of the Paxos algorithm. An Arakoon cluster consists of a small set of nodes that all contain the full range of key-value pairs in an internal DB. Next to this DB each node contains the transaction log and a transaction queue. While each node of the cluster carries all the data, yet only one node is assigned to be the master node. The master node manages all the updates for all the clients. The nodes in the cluster vote to select the master node. As long as there is a majority to select a master, the Arakoon DB will remain accessible. To make sure the Arakoon DB can survive a datacenter failure the nodes of the cluster are spread across multiple datacenters.

The steps to store a key in Arakoon

Whenever a key is to be stored in the database, following flow is executed:

  1. Updates to the Arakoon database are consistent. An Arakoon client always looks up the master of a cluster and then sends a request to the master. The master node of a cluster has a queue of all client requests. The moment that a request is queued, the master node sends the request to all his slaves and writes the request in the Transaction Log (TLog). When the slaves receive a request, they store this also in their proper TLog and send an acknowledgement to the master.
  2. A master awaits for the acknowledgements of the slaves. When he receives an acknowledgement of half the nodes plus one, the master pushes the key/value pair in its database. In a five node setup (one master and four slaves), the master must receive an acknowledgement of two slaves before he writes his data to the database, since he is also taken into account as node.
  3. After having written his data in his database, the master starts the following request in his queue. When a slave receives this new request, the slaves first write the previous request in their proper database before handling the new request. This way a slave is always certain that the master has successfully written the data in his proper database.

The benefits of using Arakoon

Scalability

Since the metadata of the ALBA backends gets sharded across different Arakoon clusters, scaling the metadata, capacity or performance wise, is as simple as adding more Arakoon nodes. The whole platform has been designed to store gigabytes of metadata without the metadata being a performance bottleneck.

High Availability

It is quite clear that keeping the metadata safe is essential for any storage solution. Arakoon is designed to be used in high available clusters. By default it stores 3 replicas of the metadata but for extra resilience 5-way replication or more can also be configured. These replica’s can even be stored across locations, allowing for a multi-site block storage cluster which can survive a datacenter loss.

Performance

Arakoon was designed with performance in mind. OCaml was selected as programming language for its reliability and performance. OCaml provides powerful and succinct concurrency (cooperative multitasking), a must in distributed environments. To further boost performance a forced master capability is available which makes sure metadata reads are being served by local Arakoon nodes in case of a multi-site block storage cluster. With Arakoon the master node is local so it has a sub-millisecond latency. As an example, Cassandra, another distributed DB which is used in many projects, requires read consistency by reading the data from multiple datacenters. This leads to a latency that is typically higher than 10 milliseconds.

Accelerated ALBA as read cache

read cache performanceWith the Fargo release we introduce a new architecture which moves the read cache from the Volume Driver to the ALBA backend. I already explained the new backend concepts in a previous blog post but I would also like to shed some light on the various reasons why we took the decision to move the read cache to ALBA. An overview:

Performance

Performance is absolutely the main reason why we decided to move the read cache layer to ALBA. It allows us to remove a big performance bottleneck: locks. When the Volume Driver was in charge of the read cache, we used a hash based upon the volume ID and the LBA to find where the data was stored on the SSD of the Storage Router. When new data was added to the cache – on every write – old data in the cache had to be overwritten. In order to evict data from the cache a linked list was used to track the LRU (Least Recently Used) data. Consequently we had to lock the whole SSD for a while. The lock was required as the hash table (volume ID + LBA) and the linked list had to be updated simultaneously. This write lock also causes delay for read requests as the lock prevents data to be safely read. Basically, in order to increase the performance we had to move towards a lockless read cache where data isn’t updated in place.
This is where ALBA comes in. The ALBA backend doesn’t update data in place but uses a log-structured approach where data is always appended. As ALBA stores chunks of the SCOs, writes are consecutive and large in size. This greatly improves the write bandwidth to the SSDs. ALBA also allows to align cores with the ASD processes and underlying SSDs. By making the whole all-flash ALBA backend core aligned, the overhead of process switching can be minimised. Basically all operations on flash are now asynchronous, core aligned and lockless. All these changes allow Open vStorage to be the fastest distributed block store.

Lower impact of an SSD failure

By moving the read cache to the ALBA backend the impact of an SSD failure is much lower. ALBA allows to perform erasure coding across all SSDs of all nodes in the rack or datacenter. This means the read cache is now distributed and the impact of an SSD failure is limited as only a fraction of the cache is lost. So in case a single SSD fails, there is no reason to go the HDD based capacity backend as the reads can still be fulfilled based upon other fragments of the data which are still cached.

Always hot cache

While Open vStorage has always been capable of supporting live migration, we noticed that with previous versions of the architecture the migrate wasn’t always successful due to the cold cache on the new host. By using the new distributed cache approach, we now have have an always hot cache even in case of (live) migrations.

We hope the above reasons proof that we took the right decision by moving the read cache to ALBA backend. Want to see how you configure the ALBA read cache, check out this GeoScale demo.

The Edge, a lightweight block device

edge block storageWhen I present the new Open vStorage architecture for Fargo, I almost always receive the following Edge question:

What is the Edge and why did you develop it?

What is the Edge about?

The Edge is a lightweight software component which can be installed on a Linux host. It exposes a block device API and connects to the Storage Router across the network (TCP/IP or RDMA). Basically the applications believes it talks to a local block device (the Edge) while the volume actually runs on another host (Storage Router).

Why did we develop the Edge?

The reason why we have developed the Edge is quite simple: componentization. With Open vStorage we are mainly dealing with large, multi-petabyte deployments and having this Edge component gives additional benefits in large environments:

Scalability

In large environments you want to be able to scale the compute and storage part independently. In case you run Open vStorage hyper-converged, as advised with earlier versions, this isn’t possible. This has as consequence that if you need more RAM or CPU to run VMs, you had to also invest in more SSDs. With the Edge you can scale compute and storage independent.

Guaranteed performance

With Eugene the Volume Driver, the high performance distributed block layer, was running on the compute host together with the VMs. This results in the VMs and the Volume Driver fighting for the same CPU and RAM resources. This is a typical issue with hyper-converged solutions. The Edge component avoids this problem as it runs on the compute hosts (and requires only a small amount of resources) and the Volume Drivers runs on dedicated nodes and hence provides a predictable and consistent amount of IOPS to the VMs.

Limit the Impact of Updates

Storage software updates are a (storage) administrator’s worst nightmare. In previous Open vStorage versions an update of the Volume Driver required all VMs on that node to be migrated or brought down.With the Edge the Volume Driver can be updated in the background as each Edge/compute host has HA features and can automatically connect to another Volume Driver on request without the need of a VM migration.

Fargo RC3

We released Fargo RC3 . This release focusses on bugfixing (13 bugs fixed) and stability.

Some items where also added to improve the supportability of an Open vStorage cluster:

  • Improved the speed of the non-cached API and GUI queries by a factor 10 to 30.
  • It is now possible to add more NSM clusters to store the data for a backend through an API instead of doing it manually.
  • Blocking to set a clone as template.
  • Hardening the remove node procedure.
  • Removed ETCD support for the config management as it was no longer maintained.
  • Added an indicator in the GUI which displays when a domain is set as recovery domain and not as primary anywhere in the cluster.
  • Support for the removal of the ASD manager.
  • Added a call to list the manually started jobs (f.e. verify namespace) on ALBA.
  • Added a timestamp to list-asds so it can be tracked how long an ASD is already part of the backend.
  • Removed the Volume Driver testing by creating a new volume in the Health Check as it created too many false positives to be used reliable.

The different ALBA Backends explained

open vstorage alba backendsWith the latest release of Open vStorage, Fargo, the backend implementation received a complete revamp in order to better support the geoscale functionality. In a geoscale cluster, the data is spread over multiple datacenters. If one of the datacenters would go offline, the geoscale cluster stays up and running and continues to serve data.

The geoscale functionality is based upon 2 concepts: Backends and vPools. These are probably the 2 most important concepts of the Open vStorage architecture. Allow me to explain in detail what the difference is between a vPool and a Backend.

Backend

A backend is a collections of physical disks, devices or even backends. Next to grouping disks or backends it also defines how data is stored on its constituents. Parameters such as erasure coding/replication factor, compression, encryption need to be defined. Ordinarily a geoscale cluster will have multiple backends. While Eugene, the predecessor release of Fargo, only had 1 type of backend, there are now 2 types: a local and a global backend.

  • A local backend allows to group physical devices. This type is typically used to group disks within the same datacenter.
  • A Global backend allows to combine multiple (local) backends into a single (global) backend. This type of backend typically spans multiple datacenters.

Backends in practice

In each datacenter of an Open vStorage cluster there are multiple local backends. A typical segregation happens based upon the performance of the devices in the datacenter. An SSD backends will be created with devices which are fast and low latency and an HDD backend will be created with slow(er) devices which are optimised for capacity. In some cases the SSD or HDD backend will be split in more backends if they contain many devices for example by selecting every x-th disk of a node. This approach limits the impact of a node failure on a backend.
Note that there is no restriction for a local backend to only use disks within the same datacenter. It is perfectly possible to select disks from different datacenters and add them to the same backend. This doesn’t make sense of course for an SSD backend as the latency between the datacenters will be a performance limiting factor.
Another reason to create multiple backends is if you want to offer each customer his own set of physical disks for security or compliance reasons. In that case a backend is created per customer.

vPool

A vPool is a configuration template for vDisks, volumes being served by Open vStorage. This template contains a whole range of parameters such as blocksize to be used, SCO size on the backend, default write buffer size, preset to be used for data protection, hosts on which the volume can live, the backend where the data needs to be stored and whether data needs to be cached. These last 2 are particularly interesting as they express how different ALBA backends are tied together. When you create a vPool you select a backend to store the volume data. This can be a local backend, SSD for an all-flash experience or a global backend in case you want to spread data over multiple datacenters. This backend is used for every Storage Router serving the vPool. If you use a global backend across multiple datacenters, you will want to use some sort of caching in the local datacenter where the volume is running. Do this in order to keep the read latency as low as possible. To achieve this by assign a local SSD backend when extending a vPool to a certain Storage Router. All volumes being served by that Storage Router will on a read first check if the requested data is in the SSD backend. This means that Storage Routers in different datacenters will use a different cache backend. This approach allows to keep hot data in the local SSD cache and store cold data on the capacity backend which is distributed across datacenters. By using this approach Open vStorage can offer stunning performance while distributing the data across multiple datacenters for safety.

A final note

To summarise, an Open vStorage cluster can have multiple and different ALBA backends: local vs. global backends, SSD and HDD backends. vPools, a grouping of vDisks which share the same config, are the glue between these different backends.

Interview With Bob Griswold, Chairman, Open vStorage

The website Storage Newsletter did an interview with our own Bob Griswold, Chairman of Open vStorage. In this Q&A Bob gave an answer to various questions such as what kind of storage product is Open vStorage, what is our vision, open source involvement, our business model and many more interesting questions.

Read the complete interview here.

Fargo RC2

We released Fargo RC2 . Biggest new items in this release:

  • Multiple performance improvements such as multiple proxies per volume driver (the default amount is 2), bypassing the proxy and go straight from the volume driver to the ASD in case of partial reads, local read preference in case of global backends (try to read from ASDs in the same datacenter instead of going over the network to another datacenter).
  • API to limit the amount of data that gets loaded into the memory of the volume driver host. Instead of loading all metadata ofa vdisk into RAM, you can now specify the % it can take in RAM.
  • Counter which keeps track of the amount of invalid checksum per ASD so we can flag bad ASDs faster.
  • Configuring the scub proxy to be cache on write.
  • Implemented timeouts for the volume driver calls.

The team also solved 110 issues between RC1 and RC2. An overview of the complete content can be found here: Added Features | Added Improvements | Solved Bugs

File Storage blogpost: impressive and probably one of the most comprehensive

File Storage, one of the leading blogs about storage, is featuring Open vStorage in one of their latest blogposts. You can read the full blog post here. We believe they are 100% correct in their conclusion:

The Open vStorage solution is really impressive and is probably one of the most comprehensive in its category.

Just a small note, we are not confidential, rather we are conservative and hence not well known yet. It takes years to build and stabilize a storage system of the scale we’ve built with Open vStorage!

2017, the Open vStorage predictions

2017, the Open vStorage predictions
2017 promises to be an interesting year for the storage industry. New technology is knocking at the door and present technology will not surrender without a fight. Not only new technology will influence the market but the storage market itself is morphing:

Further Storage consolidation

Let’s say that December 2015 was an appetizer with Netapp buying Solidfire. But in 2016 the storage market went through the first wave of consolidation: Docker storage start-up ClusterHQ shut its doors, Violin Memory filed for chapter 11, Nutanix bought PernixData , Nexgen was acquired by Pivot 3, Broadcom acquired Brocade, Samsung acquired Joyent. Lastly there was also the mega merger between storage mogul EMC and Dell. This consolidation trend will continue in 2017 as the environment for hyper-converged, flash and object storage startups is getting tougher because all the traditional vendors now offer their own flavor. As the hardware powering these solutions is commodity, the only differentiator is software.

Some interesting names to keep an eye on for M&A action or closure: Cloudian, Minio, Scality, Scale Computing, Stratoscale, Atlantis Computing, HyperGrid/Gridstore, Pure Storage, Tegile, Kaminario, Tintri, Nibmle Storage, Simplivity, Scale Computing, Primary Data, … We are pretty sure some of these name will not make it past 2017.

Open vStorage has already a couple of large projects lined up. 2017 sure looks promising for us.

The Hybrid cloud

Back from the dead like a phoenix. I expect a new live for the the hybrid cloud. Enterprises increasingly migrated to the public cloud in 2016 and this will only accelerate, both in speed and numbers. There are now 5 big clouds: Amazon AWS, Microsoft Azure, IBM, Google and Oracle.
But connecting these public cloud with in-house datacenter assets will be key. The gap between public and private clouds has never been smaller. AWS and VMware, 2 front runners, are already offering products to migrate between both solutions. Network infrastructure (performance, latency) is now finally also capable of turning the hybrid cloud into reality. Numerous enterprises will realise that going to the public cloud isn’t the only option for future infrastructure. I believe migration of storage and workloads will be one of the hottest features of Open vStorage in 2017. Hand in hand with the migration of workloads we will see the birth of various new storage as a service providers offering S3, secondary but also primary storage out of the public cloud.

On a side note, HPE (Helion), Cisco (Intercloud) and telecom giant Verizon closed their public cloud in 2016. It will be good to keep an eye out on these players to see what they are up to in 2017.

The end of Hyper-Convergence hype

In the storage market prediction for 2015 I predicted the rise of hyper-convergence. Hyper-converged solutions have lived up to their expectations and have become a mature software solution. I believe 2017 will mark a turning point for the hyper-convergence hype. Let’s sum up some reasons for the end of the hype cycle:

  • The hyper-converged market is mature and the top use cases have been identified: SMB environments, VDI and Remote Office/Branch Office (ROBO).
  • Private and public clouds are becoming more and more centralised and large scale. More enterprises will come to understand that the one-size-fits-all and everything-in-a-single-box approach of hyper-converged systems doesn’t scale to a datacenter level. This is typically an area where hyper-converged solutions reach their limits.
  • The IT world works like a pendulum. Hyper-convergence brought flash as cache into the server as the latency to fetch data over the network was too high. With RDMA and round trip times of 10 usec and below, the latency of the network is no longer the bottleneck. The pendulum is now changing its direction as the so web-scalers, the companies on which the hyper-convergence hype is ented, want to disaggregate storage by moving flash out of each individual server into more flexible, centralized repositories.
  • Flash, Flash, Flash, everything is becoming flash. As stated earlier, the local flash device was used to accelerate slow SATA drives. With all-flash versions, these hyper-converged solutions go head to head with all-flash arrays.

One of the leaders of the hyper-converged pack has already started to move into the converged infrastructure direction by releasing a storage only appliance. It will be interesting to see who else follows.

With the new Fargo architecture which is designed for large scale, multi petabyte, multi datacenter environments, we already capture the next trend: meshed, hyper-aggregated architectures. The Fargo release supports RDMA, allows to built all-flash storage pools and incorporates a distributed cache across all flash in the datacenter. 100% future proof and ready to kickstart 2017.

PS. If you want to run Open vStorage hyper-converged, feel free to do so. We have componetized Open vStorage so you can optimize it for your use case: run everything in a single box or spread the components across different servers or even datacenters!

IoT storage lakes

More and more devices are connected to the internet. This Internet of Things (IoT) is posed to generate a tremendous amount of data. Not convinced? Intel research for example estimated that autonomous cars will produce 4 terabytes of data daily per car. These Big Data lakes need a new type of storage: storage which is ultra-scalable. Traditional storage is simply not suited to process this amount of storage. On top in 2017 we will see artificial intelligence increasingly being used to mine data in these lakes. This means the performance of the storage needs to able to serve real-time analytics. Since IoT device can be located anywhere in the world, geo-redundancy and geo-distribution are also required. Basically IoT use cases are a perfect match for the Open vStorage technology.

Some interesting fields and industries to follow are consumer goods (smart thermostats, IP cameras, toys, …), automotive and healthcare.