Hybrid cloud, the phoenix of cloud computing

Introduction


Hybrid cloud, an integration between both on-site, private and public clouds, has been declared dead many times over the past few years but like a phoenix it keeps on resurrecting in the yearly IT technology and industry forecasts.

Limitations, hurdles and issues

Let’s first have a look at the numerous reasons why the hybrid cloud computing trend hasn’t taken off (yet):

  • Network limitations: connecting to a public cloud was often cumbersome as it requires all traffic to go over slow, high latency public internet links.
  • Storage hurdles: implementing a hybrid cloud approach means storing data multiple times and keeping these multiple copies in sync.
  • Integration complexity: each cloud, whether private or public, has its own interface and standards which make integration unnecessary difficult and complex.
  • Legacy IT: existing on-premise infrastructure is a reality and holds back a move to the public cloud. Next to the infrastructure component, applications were not built or designed in such a way that you can scale them up and down. Nor are they designed to store their data in an object store.

Taking the above into account it shouldn’t come as a surprise that many enterprises saw public cloud computing as a check-in at Hotel California. The technical difficulties and the cost and the risk of moving back and forth between clouds was just too big. But times are changing. According to McKinsey & Company, a leading management consulting firm, over the next 3 years enterprises are planning to transition IT workloads at a significant rate and pace to a hybrid cloud infrastructure.

Hybrid cloud (finally) taking off

I see a couple a reasons why the hybrid cloud approach is finally taking off:

Edge computing use case
Smart ‘devices’ such as self driving cars are producing such large amounts of data that they can’t rely on public clouds to process it all. The data sometimes even drives real-time decisions where latency might be the difference between life or dead. Evolutionary, this requires that computing power shifts to the edges of the network. This Edge or Fog Computing concept is a textbook example of a hybrid cloud where on-site, or should we call it on-board, computing and centralized computing are grouped together into a single solution.

The network limitations are removed
The network limitations have been removed by services like AWS Direct Connect. With these you have a dedicated network connection from your premises to the Amazon cloud. All big cloud providers now offer the option for a dedicated network into their cloud. Pricing for dedicated 10GbE links in metropolitan regions like New York have also dropped significantly. For under $1.000 a month you can now get a sub millisecond fibre connection from most building in New York to one of the many data centers in New York.

Recovery realisation
More and more enterprises with a private cloud realise the need for a disaster recovery plan.
In the past this meant getting a second private cloud. This approach multiplies the TCO by at least a factor 2 as twice the amount of hardware needs to be purchased. Keeping both private clouds in sync makes disaster recovery plans only more complex. Instead of making disaster recovery a cost, enterprises are now turning disaster recovery into an asset instead of a cost. Enterprises now use cheap, public cloud storage to store their off-site backups and copies. By adding compute capacity in peak periods or when disaster strikes they can bring these off-site copies online when needed. On top, additional business analytics can also use these off-site copies without impacting the production workloads.

Standardization
Over the past years standards in cloud computing have crystallized. In the public cloud Amazon has set the standard for storing unstructured data. On the private infrastructure side, the OpenStack ecosystem has made significant progress in streamlining and standardizing how complete clouds are deployed. Enterprises such as Cisco for example are now focussing on new services to manage and orchestrate clouds in order to smooth out the last bumps in the migration between different clouds.

Storage & legacy hardware: the problem children

Based upon the previous paragraphs one might conclude that all obstacles to move to the hybrid model have been cleared. This isn’t the case as 2 issues still strike up:.

The legacy hardware problem
All current public cloud computing solutions ignore the reality that enterprises have a hardware legacy. While starting from scratch is the easiest solution, it is definitely not the cheapest. In order for the hybrid cloud to be successful, existing hardware must in some form or shape be able to be integrated in the hybrid cloud.

Storage roadblocks remain
In case you want to make use of multiple cloud solutions, the only solution you have is to store a copy of each bit of data in each cloud. This x-way replication scheme solves the issue of data being available in all cloud locations but it solves it at a high cost. Next to the replication cost, replication also adds significant latency as writes can only be acknowledged if all location are up to date. This means that in case replication is used hybrid clouds which span the east and west coast of the US are not workable.

Open vStorage removes those last obstacles

Open vStorage, a software based storage solution, allows multi-datacenter block storage in a much more nimble and cost-effective way than any traditional solution. This way it removes the last roadblocks towards the hybrid cloud adoption.
Solving the storage puzzle
Instead of X-way replication Open vStorage uses a different approach which can be compared to solving a Sudoku puzzle. All data is chopped up in chunks and additionally some parity chunks are adjoined. All these chunks, the data and parity chunks, are distributed across all the nodes, datacenters and clouds in the cluster. The amount of parity chunks can be configured but allows for example to recover from a multi node failure or a complete data center loss. A failure, whether it is a disk, node or data center will cross out some numbers from the complete Sudoku puzzle but as long as you have enough numbers left, you can still solve the puzzle. The same goes for data stored with Open vStorage: as long as you have enough chunks (disks, nodes, data centers or clouds) left, you can always recover the data.
Unlike X-way replication where data is only acknowledged once all copies are stored safely, Open vStorage allows to store data sub-optimally. This has as big advantage that it allows to acknowledge writes in case not all data chunks are written to disk. This makes sure that a single slow disk, datacenter or cloud, doesn‘t detain applications and incoming writes. This approach lowers the write latency while keeping data safety at a high level.

Legacy hardware
Open vStorage also allows to include legacy storage hardware. As Open vStorage is a software based storage solution, it can turn any x86 hardware into a piece of the hybrid storage cloud.
Open vStorage leverages the capabilities of new media technologies like SSDs and PCI-e flash but also those of older technologies like large capacity traditional SATA drives. For applications that need above par performance additional SSDs and PCI-e flash cards can be added.

Summary

Hybrid Cloud has long been a model that was chased by many enterprises without any luck. Issues such as network and storage limitations and integration complexity have been major roadblocks on the hybrid cloud path. Over the last few years a lot of these roadblocks have been removed but issues with storage and legacy hardware remained. Open vStorage overcomes these last obstacles and paves the path towards hybrid cloud adoption.

Keeping an eye on an Open vStorage cluster

Open vStorage offers as part of the commercial package 2 options to monitor an Open vStorage cluster. The OPS team acts as a second set of eyes or the OPS team has the keys, is in the driving seat and has full control. In both cases these large scale (+5PB) Open vStorage clusters send the logs to a centralized monitoring cluster managed by the OPS team. This custom monitoring cluster is based based upon scalable tools such as Elasticsearch, InfluxDB, Kibana, Grafana and CheckMK. Let’s have a look at the different components the OPS team uses. Note that these tools are only part of the Open vStorage commercial package.

Elasticsearch & Kibana

To expose the internals of an Open vStorage cluster, the team opted to run an ELK (Elasticsearach, Logstash, Kibana) stack to gather logging information and centralise all this information into a single viewing pane.

The ELK-stack consists of 3 open source components:

  • Elasticsearch: a NoSQL database, based on Apache’s Lucene engine, which stores all log files.
  • Logstash: a log pipeline tool which accepts various inputs and targets. In our case, it will read logging from a Redis queue and store them into Elasticsearch.
  • Kibana: a visualisation tool on top of Elasticsearch.

Next to the ELK stack, Journalbeat is used to fetch the logging from all nodes of the cluster and put them onto Redis. Logstash consumes the Redis queue and stores the log messages into Elasticsearch. By aggregating all logs from a cluster into a single, unified view, detecting anomalies or finding correlation between issues is easier.

InfluxDB & Grafana

The many statistics that are being tracked are stored into an InfluxDB, an open source database specifically designed to handle time series data. On top of the InfluxDB Grafana is used to visualize these statistics. The dashboards give a detailed view on the performance metrics of the cluster as a whole but also of the individual components. The statistics are provided in an aggregated view but a OPS member can also drill down to the smallest detail such as the individual vDisks level. The metrics that are tracked range from IO latency at different levels, throughput and operations per second, safety of the objects in the backend to the amount of maintenance tasks that are running across the cluster.

CheckMK

To detect and escalate issues the Open vStorage team uses CheckMK, an extension to the open source Nagios monitoring system. The CheckMK cluster is loaded with many monitoring rules based upon years of experience in monitoring large scale (storage) clusters. These monitoring rules includes general checks such as the CPU and RAM of a host, the services, network performance and disk health but of course specific checks for Open vStorage components such as the Volume Driver or Arakoon have also been added. The output of the healthcheck also gets parsed by the CheckMK engine. In case of issues a fine-tuned escalation process is put into motion in order to resolve these issues quickly.

Arakoon, a battle hardened key-value DB

Arakoon LogoAt Open vStorage we just love Arakoon, our in-house developed key-value DB. It is always consistent and hence prefers to die instead of giving you wrong data. Thrust us, this is a good property if you are building a storage platform. It is also pretty fast, especially in a multi-datacenter topology. And above all, it is battle hardened over 7 years and it is now ROCK. SOLID.

We use Arakoon almost everywhere in Open vStorage. We use it to store the framework model, volume ownership and to keep track of the ALBA backend metadata. So it is time we tell you a bit more about that Arakoon beast. Arakoon is developed by the Open vStorage team and the core has been made available as open-source project on GitHub. It is already battle proven in several of the Open vStorage solutions and projects by technology leaders such as Western Digital and iQIYI, a subsidiary of Baidu.

Arakoon aims to be easy to understand and use, whilst at the same time taking the following features into consideration:

  • Consistency: The system as a whole needs to provide a consistent view on the distributed state. This stems from the experience that eventual consistency is too heavy a burden for a user application to manage. A simple example is the retrieval of the value for a key where you might receive none, one or multiple values depending on the weather conditions. The next question is always: Why don’t I a get a result? Is it because there is no value, or merely because I currently cannot retrieve it?
  • Conditional and Atomic Updates: We don’t need full blown transactions (would be nice to have though), but we do need updates that abort if the state is not what we expect it to be. So at least an atomic conditional update and an atomic multi-update are needed.
  • Robustness: The system must be able to cope with the failure of individual components, without concessions to consistency. However, whenever consistency can no longer be guaranteed, updates must simply fail.
  • Locality Control: When we deploy a system over 2 datacenters, we want guarantees that the entire state is indeed present in both datacenters. This is something we could not get from distributed hash tables using consistent hashing.
  • Healing & Recovery: Whenever a component dies and is subsequently revived or replaced, the system must be able to guide that component towards a situation where that node again fully participates. If this cannot be done fully automatically, then human intervention should be trivial.
  • Explicit Failure: Whenever there is something wrong, failure should propagate quite quickly.

Sounds interesting, right? Let’s share some more details on the Arakoon internals. It is a distributed key/value database. Since it is strongly consistent, it prefers to stop instead of providing out-dated, faulty values even in case multiple components fail. To achieve this Arakoon uses a variation of the Paxos algorithm. An Arakoon cluster consists of a small set of nodes that all contain the full range of key-value pairs in an internal DB. Next to this DB each node contains the transaction log and a transaction queue. While each node of the cluster carries all the data, yet only one node is assigned to be the master node. The master node manages all the updates for all the clients. The nodes in the cluster vote to select the master node. As long as there is a majority to select a master, the Arakoon DB will remain accessible. To make sure the Arakoon DB can survive a datacenter failure the nodes of the cluster are spread across multiple datacenters.

The steps to store a key in Arakoon

Whenever a key is to be stored in the database, following flow is executed:

  1. Updates to the Arakoon database are consistent. An Arakoon client always looks up the master of a cluster and then sends a request to the master. The master node of a cluster has a queue of all client requests. The moment that a request is queued, the master node sends the request to all his slaves and writes the request in the Transaction Log (TLog). When the slaves receive a request, they store this also in their proper TLog and send an acknowledgement to the master.
  2. A master awaits for the acknowledgements of the slaves. When he receives an acknowledgement of half the nodes plus one, the master pushes the key/value pair in its database. In a five node setup (one master and four slaves), the master must receive an acknowledgement of two slaves before he writes his data to the database, since he is also taken into account as node.
  3. After having written his data in his database, the master starts the following request in his queue. When a slave receives this new request, the slaves first write the previous request in their proper database before handling the new request. This way a slave is always certain that the master has successfully written the data in his proper database.

The benefits of using Arakoon

Scalability

Since the metadata of the ALBA backends gets sharded across different Arakoon clusters, scaling the metadata, capacity or performance wise, is as simple as adding more Arakoon nodes. The whole platform has been designed to store gigabytes of metadata without the metadata being a performance bottleneck.

High Availability

It is quite clear that keeping the metadata safe is essential for any storage solution. Arakoon is designed to be used in high available clusters. By default it stores 3 replicas of the metadata but for extra resilience 5-way replication or more can also be configured. These replica’s can even be stored across locations, allowing for a multi-site block storage cluster which can survive a datacenter loss.

Performance

Arakoon was designed with performance in mind. OCaml was selected as programming language for its reliability and performance. OCaml provides powerful and succinct concurrency (cooperative multitasking), a must in distributed environments. To further boost performance a forced master capability is available which makes sure metadata reads are being served by local Arakoon nodes in case of a multi-site block storage cluster. With Arakoon the master node is local so it has a sub-millisecond latency. As an example, Cassandra, another distributed DB which is used in many projects, requires read consistency by reading the data from multiple datacenters. This leads to a latency that is typically higher than 10 milliseconds.

Accelerated ALBA as read cache

read cache performanceWith the Fargo release we introduce a new architecture which moves the read cache from the Volume Driver to the ALBA backend. I already explained the new backend concepts in a previous blog post but I would also like to shed some light on the various reasons why we took the decision to move the read cache to ALBA. An overview:

Performance

Performance is absolutely the main reason why we decided to move the read cache layer to ALBA. It allows us to remove a big performance bottleneck: locks. When the Volume Driver was in charge of the read cache, we used a hash based upon the volume ID and the LBA to find where the data was stored on the SSD of the Storage Router. When new data was added to the cache – on every write – old data in the cache had to be overwritten. In order to evict data from the cache a linked list was used to track the LRU (Least Recently Used) data. Consequently we had to lock the whole SSD for a while. The lock was required as the hash table (volume ID + LBA) and the linked list had to be updated simultaneously. This write lock also causes delay for read requests as the lock prevents data to be safely read. Basically, in order to increase the performance we had to move towards a lockless read cache where data isn’t updated in place.
This is where ALBA comes in. The ALBA backend doesn’t update data in place but uses a log-structured approach where data is always appended. As ALBA stores chunks of the SCOs, writes are consecutive and large in size. This greatly improves the write bandwidth to the SSDs. ALBA also allows to align cores with the ASD processes and underlying SSDs. By making the whole all-flash ALBA backend core aligned, the overhead of process switching can be minimised. Basically all operations on flash are now asynchronous, core aligned and lockless. All these changes allow Open vStorage to be the fastest distributed block store.

Lower impact of an SSD failure

By moving the read cache to the ALBA backend the impact of an SSD failure is much lower. ALBA allows to perform erasure coding across all SSDs of all nodes in the rack or datacenter. This means the read cache is now distributed and the impact of an SSD failure is limited as only a fraction of the cache is lost. So in case a single SSD fails, there is no reason to go the HDD based capacity backend as the reads can still be fulfilled based upon other fragments of the data which are still cached.

Always hot cache

While Open vStorage has always been capable of supporting live migration, we noticed that with previous versions of the architecture the migrate wasn’t always successful due to the cold cache on the new host. By using the new distributed cache approach, we now have have an always hot cache even in case of (live) migrations.

We hope the above reasons proof that we took the right decision by moving the read cache to ALBA backend. Want to see how you configure the ALBA read cache, check out this GeoScale demo.

The Edge, a lightweight block device

edge block storageWhen I present the new Open vStorage architecture for Fargo, I almost always receive the following Edge question:

What is the Edge and why did you develop it?

What is the Edge about?

The Edge is a lightweight software component which can be installed on a Linux host. It exposes a block device API and connects to the Storage Router across the network (TCP/IP or RDMA). Basically the applications believes it talks to a local block device (the Edge) while the volume actually runs on another host (Storage Router).

Why did we develop the Edge?

The reason why we have developed the Edge is quite simple: componentization. With Open vStorage we are mainly dealing with large, multi-petabyte deployments and having this Edge component gives additional benefits in large environments:

Scalability

In large environments you want to be able to scale the compute and storage part independently. In case you run Open vStorage hyper-converged, as advised with earlier versions, this isn’t possible. This has as consequence that if you need more RAM or CPU to run VMs, you had to also invest in more SSDs. With the Edge you can scale compute and storage independent.

Guaranteed performance

With Eugene the Volume Driver, the high performance distributed block layer, was running on the compute host together with the VMs. This results in the VMs and the Volume Driver fighting for the same CPU and RAM resources. This is a typical issue with hyper-converged solutions. The Edge component avoids this problem as it runs on the compute hosts (and requires only a small amount of resources) and the Volume Drivers runs on dedicated nodes and hence provides a predictable and consistent amount of IOPS to the VMs.

Limit the Impact of Updates

Storage software updates are a (storage) administrator’s worst nightmare. In previous Open vStorage versions an update of the Volume Driver required all VMs on that node to be migrated or brought down.With the Edge the Volume Driver can be updated in the background as each Edge/compute host has HA features and can automatically connect to another Volume Driver on request without the need of a VM migration.

Fargo RC3

We released Fargo RC3 . This release focusses on bugfixing (13 bugs fixed) and stability.

Some items where also added to improve the supportability of an Open vStorage cluster:

  • Improved the speed of the non-cached API and GUI queries by a factor 10 to 30.
  • It is now possible to add more NSM clusters to store the data for a backend through an API instead of doing it manually.
  • Blocking to set a clone as template.
  • Hardening the remove node procedure.
  • Removed ETCD support for the config management as it was no longer maintained.
  • Added an indicator in the GUI which displays when a domain is set as recovery domain and not as primary anywhere in the cluster.
  • Support for the removal of the ASD manager.
  • Added a call to list the manually started jobs (f.e. verify namespace) on ALBA.
  • Added a timestamp to list-asds so it can be tracked how long an ASD is already part of the backend.
  • Removed the Volume Driver testing by creating a new volume in the Health Check as it created too many false positives to be used reliable.

The different ALBA Backends explained

open vstorage alba backendsWith the latest release of Open vStorage, Fargo, the backend implementation received a complete revamp in order to better support the geoscale functionality. In a geoscale cluster, the data is spread over multiple datacenters. If one of the datacenters would go offline, the geoscale cluster stays up and running and continues to serve data.

The geoscale functionality is based upon 2 concepts: Backends and vPools. These are probably the 2 most important concepts of the Open vStorage architecture. Allow me to explain in detail what the difference is between a vPool and a Backend.

Backend

A backend is a collections of physical disks, devices or even backends. Next to grouping disks or backends it also defines how data is stored on its constituents. Parameters such as erasure coding/replication factor, compression, encryption need to be defined. Ordinarily a geoscale cluster will have multiple backends. While Eugene, the predecessor release of Fargo, only had 1 type of backend, there are now 2 types: a local and a global backend.

  • A local backend allows to group physical devices. This type is typically used to group disks within the same datacenter.
  • A Global backend allows to combine multiple (local) backends into a single (global) backend. This type of backend typically spans multiple datacenters.

Backends in practice

In each datacenter of an Open vStorage cluster there are multiple local backends. A typical segregation happens based upon the performance of the devices in the datacenter. An SSD backends will be created with devices which are fast and low latency and an HDD backend will be created with slow(er) devices which are optimised for capacity. In some cases the SSD or HDD backend will be split in more backends if they contain many devices for example by selecting every x-th disk of a node. This approach limits the impact of a node failure on a backend.
Note that there is no restriction for a local backend to only use disks within the same datacenter. It is perfectly possible to select disks from different datacenters and add them to the same backend. This doesn’t make sense of course for an SSD backend as the latency between the datacenters will be a performance limiting factor.
Another reason to create multiple backends is if you want to offer each customer his own set of physical disks for security or compliance reasons. In that case a backend is created per customer.

vPool

A vPool is a configuration template for vDisks, volumes being served by Open vStorage. This template contains a whole range of parameters such as blocksize to be used, SCO size on the backend, default write buffer size, preset to be used for data protection, hosts on which the volume can live, the backend where the data needs to be stored and whether data needs to be cached. These last 2 are particularly interesting as they express how different ALBA backends are tied together. When you create a vPool you select a backend to store the volume data. This can be a local backend, SSD for an all-flash experience or a global backend in case you want to spread data over multiple datacenters. This backend is used for every Storage Router serving the vPool. If you use a global backend across multiple datacenters, you will want to use some sort of caching in the local datacenter where the volume is running. Do this in order to keep the read latency as low as possible. To achieve this by assign a local SSD backend when extending a vPool to a certain Storage Router. All volumes being served by that Storage Router will on a read first check if the requested data is in the SSD backend. This means that Storage Routers in different datacenters will use a different cache backend. This approach allows to keep hot data in the local SSD cache and store cold data on the capacity backend which is distributed across datacenters. By using this approach Open vStorage can offer stunning performance while distributing the data across multiple datacenters for safety.

A final note

To summarise, an Open vStorage cluster can have multiple and different ALBA backends: local vs. global backends, SSD and HDD backends. vPools, a grouping of vDisks which share the same config, are the glue between these different backends.

Interview With Bob Griswold, Chairman, Open vStorage

The website Storage Newsletter did an interview with our own Bob Griswold, Chairman of Open vStorage. In this Q&A Bob gave an answer to various questions such as what kind of storage product is Open vStorage, what is our vision, open source involvement, our business model and many more interesting questions.

Read the complete interview here.

Fargo RC2

We released Fargo RC2 . Biggest new items in this release:

  • Multiple performance improvements such as multiple proxies per volume driver (the default amount is 2), bypassing the proxy and go straight from the volume driver to the ASD in case of partial reads, local read preference in case of global backends (try to read from ASDs in the same datacenter instead of going over the network to another datacenter).
  • API to limit the amount of data that gets loaded into the memory of the volume driver host. Instead of loading all metadata ofa vdisk into RAM, you can now specify the % it can take in RAM.
  • Counter which keeps track of the amount of invalid checksum per ASD so we can flag bad ASDs faster.
  • Configuring the scub proxy to be cache on write.
  • Implemented timeouts for the volume driver calls.

The team also solved 110 issues between RC1 and RC2. An overview of the complete content can be found here: Added Features | Added Improvements | Solved Bugs

File Storage blogpost: impressive and probably one of the most comprehensive

File Storage, one of the leading blogs about storage, is featuring Open vStorage in one of their latest blogposts. You can read the full blog post here. We believe they are 100% correct in their conclusion:

The Open vStorage solution is really impressive and is probably one of the most comprehensive in its category.

Just a small note, we are not confidential, rather we are conservative and hence not well known yet. It takes years to build and stabilize a storage system of the scale we’ve built with Open vStorage!