Fargo GA

After 3 Release Candidates and extensive testing, the Open vStorage team is proud to announce the GA (General Availability) release of Fargo. This release is packed with new features. Allow us to give a small overview:

NC-ECC presets (global and local policies)

NC-ECC (Network Connected-Error Correction Code) is an algorithm to store Storage Container Objects (SCOs) safely in multiple data centers. It consists out of a global, across data center, preset and multiple local, within a single data center, presets. The NC-ECC algorithm is based on forward error correction codes and is further optimized for usage with a multi data center approach. When there is a disk or node failure, additional chunks will be created using only data from within the same data center. This ensures the bandwidth between data centers isn’t stressed in case of a simple disk failure.

Multi-level ALBA

The ALBA backend now supports different levels. An all SSD ALBA backend can be used as performance layer in front of the capacity tier. Data is removed from the cache layer using a random eviction or Least Recently Used (LRU) strategy.

Open vStorage Edge

The Open vStorage Edge is a lightweight block driver which can be installed on Linux hosts and connect with the Volume Driver over the network (TCP-IP). By creating different components for the Volume Driver and the Edge compute and storage can scale independently.

Performance optimized Volume Driver

By limiting the size of a volume’s metadata, the metadata now fits completely in RAM. To keep the metadata at an absolute minimum, deduplication was removed. You can read more about why we removed deduplication here. Other optimizations are multiple proxies per Volume Driver (the default amount is 2), bypassing the proxy and go straight from the Volume Driver to the ASD in case of partial reads, local read preference in case of global backends (try to read from ASDs in the same data center instead of going over the network to another data center).

Multiple ASDs per device

For low latency devices adding multiple ASDs per device provides a higher bandwidth to the device.

Distributed Config Management

When you are managing large clusters, keeping the configuration of every system up to date can be quite a challenge. With Fargo all config files are now stored in a distributed config management system on top of our distributed database, Arakoon. More info can be found here.

Ubuntu 16.04

Open vStorage is now supported on Ubuntu 16.04, the latest Long Term Support (LTS) version of Ubuntu.

Smaller features in Fargo:

  • Improved the speed of the non-cached API and GUI queries by a factor 10 to 30.
  • Hardening the remove node procedure.
  • The GUI is adjusted to better highlight clusters which are spread across multiple sites.
  • The failure domain concept has been replaced by tag based domains. ASD nodes and storage routers can now be tagged with one or more tags. Tags can be used to identify a rack, site, power feed, etc.
  • 64TB volumes.
  • Browsable API with Swagger.
  • ‘asd-manager collect logs’ identical to the ‘ovs collect logs’.
  • Support for the removal of the ads-manager packages.

Since this Fargo release introduces a completely new architecture (you can read more about it here) there is no upgrade possible between Eugene and Fargo. The full release notes can be found here.

Hybrid cloud, the phoenix of cloud computing

Introduction


Hybrid cloud, an integration between both on-site, private and public clouds, has been declared dead many times over the past few years but like a phoenix it keeps on resurrecting in the yearly IT technology and industry forecasts.

Limitations, hurdles and issues

Let’s first have a look at the numerous reasons why the hybrid cloud computing trend hasn’t taken off (yet):

  • Network limitations: connecting to a public cloud was often cumbersome as it requires all traffic to go over slow, high latency public internet links.
  • Storage hurdles: implementing a hybrid cloud approach means storing data multiple times and keeping these multiple copies in sync.
  • Integration complexity: each cloud, whether private or public, has its own interface and standards which make integration unnecessary difficult and complex.
  • Legacy IT: existing on-premise infrastructure is a reality and holds back a move to the public cloud. Next to the infrastructure component, applications were not built or designed in such a way that you can scale them up and down. Nor are they designed to store their data in an object store.

Taking the above into account it shouldn’t come as a surprise that many enterprises saw public cloud computing as a check-in at Hotel California. The technical difficulties and the cost and the risk of moving back and forth between clouds was just too big. But times are changing. According to McKinsey & Company, a leading management consulting firm, over the next 3 years enterprises are planning to transition IT workloads at a significant rate and pace to a hybrid cloud infrastructure.

Hybrid cloud (finally) taking off

I see a couple a reasons why the hybrid cloud approach is finally taking off:

Edge computing use case
Smart ‘devices’ such as self driving cars are producing such large amounts of data that they can’t rely on public clouds to process it all. The data sometimes even drives real-time decisions where latency might be the difference between life or dead. Evolutionary, this requires that computing power shifts to the edges of the network. This Edge or Fog Computing concept is a textbook example of a hybrid cloud where on-site, or should we call it on-board, computing and centralized computing are grouped together into a single solution.

The network limitations are removed
The network limitations have been removed by services like AWS Direct Connect. With these you have a dedicated network connection from your premises to the Amazon cloud. All big cloud providers now offer the option for a dedicated network into their cloud. Pricing for dedicated 10GbE links in metropolitan regions like New York have also dropped significantly. For under $1.000 a month you can now get a sub millisecond fibre connection from most building in New York to one of the many data centers in New York.

Recovery realisation
More and more enterprises with a private cloud realise the need for a disaster recovery plan.
In the past this meant getting a second private cloud. This approach multiplies the TCO by at least a factor 2 as twice the amount of hardware needs to be purchased. Keeping both private clouds in sync makes disaster recovery plans only more complex. Instead of making disaster recovery a cost, enterprises are now turning disaster recovery into an asset instead of a cost. Enterprises now use cheap, public cloud storage to store their off-site backups and copies. By adding compute capacity in peak periods or when disaster strikes they can bring these off-site copies online when needed. On top, additional business analytics can also use these off-site copies without impacting the production workloads.

Standardization
Over the past years standards in cloud computing have crystallized. In the public cloud Amazon has set the standard for storing unstructured data. On the private infrastructure side, the OpenStack ecosystem has made significant progress in streamlining and standardizing how complete clouds are deployed. Enterprises such as Cisco for example are now focussing on new services to manage and orchestrate clouds in order to smooth out the last bumps in the migration between different clouds.

Storage & legacy hardware: the problem children

Based upon the previous paragraphs one might conclude that all obstacles to move to the hybrid model have been cleared. This isn’t the case as 2 issues still strike up:.

The legacy hardware problem
All current public cloud computing solutions ignore the reality that enterprises have a hardware legacy. While starting from scratch is the easiest solution, it is definitely not the cheapest. In order for the hybrid cloud to be successful, existing hardware must in some form or shape be able to be integrated in the hybrid cloud.

Storage roadblocks remain
In case you want to make use of multiple cloud solutions, the only solution you have is to store a copy of each bit of data in each cloud. This x-way replication scheme solves the issue of data being available in all cloud locations but it solves it at a high cost. Next to the replication cost, replication also adds significant latency as writes can only be acknowledged if all location are up to date. This means that in case replication is used hybrid clouds which span the east and west coast of the US are not workable.

Open vStorage removes those last obstacles

Open vStorage, a software based storage solution, allows multi-datacenter block storage in a much more nimble and cost-effective way than any traditional solution. This way it removes the last roadblocks towards the hybrid cloud adoption.
Solving the storage puzzle
Instead of X-way replication Open vStorage uses a different approach which can be compared to solving a Sudoku puzzle. All data is chopped up in chunks and additionally some parity chunks are adjoined. All these chunks, the data and parity chunks, are distributed across all the nodes, datacenters and clouds in the cluster. The amount of parity chunks can be configured but allows for example to recover from a multi node failure or a complete data center loss. A failure, whether it is a disk, node or data center will cross out some numbers from the complete Sudoku puzzle but as long as you have enough numbers left, you can still solve the puzzle. The same goes for data stored with Open vStorage: as long as you have enough chunks (disks, nodes, data centers or clouds) left, you can always recover the data.
Unlike X-way replication where data is only acknowledged once all copies are stored safely, Open vStorage allows to store data sub-optimally. This has as big advantage that it allows to acknowledge writes in case not all data chunks are written to disk. This makes sure that a single slow disk, datacenter or cloud, doesn‘t detain applications and incoming writes. This approach lowers the write latency while keeping data safety at a high level.

Legacy hardware
Open vStorage also allows to include legacy storage hardware. As Open vStorage is a software based storage solution, it can turn any x86 hardware into a piece of the hybrid storage cloud.
Open vStorage leverages the capabilities of new media technologies like SSDs and PCI-e flash but also those of older technologies like large capacity traditional SATA drives. For applications that need above par performance additional SSDs and PCI-e flash cards can be added.

Summary

Hybrid Cloud has long been a model that was chased by many enterprises without any luck. Issues such as network and storage limitations and integration complexity have been major roadblocks on the hybrid cloud path. Over the last few years a lot of these roadblocks have been removed but issues with storage and legacy hardware remained. Open vStorage overcomes these last obstacles and paves the path towards hybrid cloud adoption.

Keeping an eye on an Open vStorage cluster

Open vStorage offers as part of the commercial package 2 options to monitor an Open vStorage cluster. The OPS team acts as a second set of eyes or the OPS team has the keys, is in the driving seat and has full control. In both cases these large scale (+5PB) Open vStorage clusters send the logs to a centralized monitoring cluster managed by the OPS team. This custom monitoring cluster is based based upon scalable tools such as Elasticsearch, InfluxDB, Kibana, Grafana and CheckMK. Let’s have a look at the different components the OPS team uses. Note that these tools are only part of the Open vStorage commercial package.

Elasticsearch & Kibana

To expose the internals of an Open vStorage cluster, the team opted to run an ELK (Elasticsearach, Logstash, Kibana) stack to gather logging information and centralise all this information into a single viewing pane.

The ELK-stack consists of 3 open source components:

  • Elasticsearch: a NoSQL database, based on Apache’s Lucene engine, which stores all log files.
  • Logstash: a log pipeline tool which accepts various inputs and targets. In our case, it will read logging from a Redis queue and store them into Elasticsearch.
  • Kibana: a visualisation tool on top of Elasticsearch.

Next to the ELK stack, Journalbeat is used to fetch the logging from all nodes of the cluster and put them onto Redis. Logstash consumes the Redis queue and stores the log messages into Elasticsearch. By aggregating all logs from a cluster into a single, unified view, detecting anomalies or finding correlation between issues is easier.

InfluxDB & Grafana

The many statistics that are being tracked are stored into an InfluxDB, an open source database specifically designed to handle time series data. On top of the InfluxDB Grafana is used to visualize these statistics. The dashboards give a detailed view on the performance metrics of the cluster as a whole but also of the individual components. The statistics are provided in an aggregated view but a OPS member can also drill down to the smallest detail such as the individual vDisks level. The metrics that are tracked range from IO latency at different levels, throughput and operations per second, safety of the objects in the backend to the amount of maintenance tasks that are running across the cluster.

CheckMK

To detect and escalate issues the Open vStorage team uses CheckMK, an extension to the open source Nagios monitoring system. The CheckMK cluster is loaded with many monitoring rules based upon years of experience in monitoring large scale (storage) clusters. These monitoring rules includes general checks such as the CPU and RAM of a host, the services, network performance and disk health but of course specific checks for Open vStorage components such as the Volume Driver or Arakoon have also been added. The output of the healthcheck also gets parsed by the CheckMK engine. In case of issues a fine-tuned escalation process is put into motion in order to resolve these issues quickly.