NSM and ABM, Arakoon teamwork

In an earlier post we shed some light on Arakoon, our own always consistent distributed key-valuedatabase. Arakoon is used in many parts of the Open vStorage platform. One of the use cases is to store the metadata of the native ALBA object store. Do note that ALBA is NOT a general purpose object store but specifically crafted and optimized for Open vStorage. ALBA uses a collection of Arakoon databases to store where and how objects are stored on the disks in the backend. Typically the SCOs and TLogs of each vDisk end up in a separate bucket, a namespace, on the backend. For each object in the namespace there is a manifest that describes where and how the object is stored on the backend. To glue the namespaces, the manifests and the disks in the backend together, ALBA uses 2 types of Arakoon databases: the ALBA Backend Manager (ABM) and one or more NameSpace Managers (NSM).

ALBA Manager

The ALBA Manager (ABM) is the entry point for all ALBA clients which want to store or retrieve something from the backend. The ALBA Manager DB knows which physical disks belong to the backend, which namespaces exist and on which NSM hosts they can be found.
To optimize the Arakoon DB it is loaded with the albamgr plugin, a collection of specific ABM user functions. Typically there is only a single ABM manager in a cluster.

NSM

A NameSpace Manager (NSM) is an Arakoon cluster which holds the manifests for the namespaces assigned to the NSM. Which NSM is managing which namespaces is registered with the ALBA Manager. The NSM is also the remote API offered by the NSM host to manipulate most of the object metadata during normal operation. Its coordinates can be retrieved from the ALBA Manager by (proxy) clients and maintenance agents.

To optimize the Arakoon DB it is loaded with the nsm_host plugin, a collection of specific NSM host user functions. Typically there are multiple NSM clusters for a single ALBA backend. This allows to scale the backend both capacity and performance wise.

IO requests

Let’s have a look at the IO path. Whenever the Volume Driver needs to store an object on the backend, a SCO or a TLog, it hands the object to one of the ALBA proxies on the same host. The ALBA proxy contains an ALBA client which communicates with the ABM to know on which NSM and disks it can store the object. Once the object is stored on the disks, the manifest with the metadata is registered in the NSM. For performance reasons the different fragment of the object and the manifest can be cached by the ALBA proxy.

In case the Volume Driver needs data from the backend, because it is no longer in the write buffer, it request the proxy to fetch the exact data by asked for a SCO location and offset. In case the right fragment are in the fragment cache, the proxy returns the data immediately to the Volume Driver. Otherwise it can use the manifest from the cache or the manifest isn’t in the cache, the proxy contacts the ABM to get the right NSM and from that the manifest. Based upon the manifest the ALBA client fetches the data it needs from the physical disks and provides it to the Volume Driver.

Fargo GA

After 3 Release Candidates and extensive testing, the Open vStorage team is proud to announce the GA (General Availability) release of Fargo. This release is packed with new features. Allow us to give a small overview:

NC-ECC presets (global and local policies)

NC-ECC (Network Connected-Error Correction Code) is an algorithm to store Storage Container Objects (SCOs) safely in multiple data centers. It consists out of a global, across data center, preset and multiple local, within a single data center, presets. The NC-ECC algorithm is based on forward error correction codes and is further optimized for usage with a multi data center approach. When there is a disk or node failure, additional chunks will be created using only data from within the same data center. This ensures the bandwidth between data centers isn’t stressed in case of a simple disk failure.

Multi-level ALBA

The ALBA backend now supports different levels. An all SSD ALBA backend can be used as performance layer in front of the capacity tier. Data is removed from the cache layer using a random eviction or Least Recently Used (LRU) strategy.

Open vStorage Edge

The Open vStorage Edge is a lightweight block driver which can be installed on Linux hosts and connect with the Volume Driver over the network (TCP-IP). By creating different components for the Volume Driver and the Edge compute and storage can scale independently.

Performance optimized Volume Driver

By limiting the size of a volume’s metadata, the metadata now fits completely in RAM. To keep the metadata at an absolute minimum, deduplication was removed. You can read more about why we removed deduplication here. Other optimizations are multiple proxies per Volume Driver (the default amount is 2), bypassing the proxy and go straight from the Volume Driver to the ASD in case of partial reads, local read preference in case of global backends (try to read from ASDs in the same data center instead of going over the network to another data center).

Multiple ASDs per device

For low latency devices adding multiple ASDs per device provides a higher bandwidth to the device.

Distributed Config Management

When you are managing large clusters, keeping the configuration of every system up to date can be quite a challenge. With Fargo all config files are now stored in a distributed config management system on top of our distributed database, Arakoon. More info can be found here.

Ubuntu 16.04

Open vStorage is now supported on Ubuntu 16.04, the latest Long Term Support (LTS) version of Ubuntu.

Smaller features in Fargo:

  • Improved the speed of the non-cached API and GUI queries by a factor 10 to 30.
  • Hardening the remove node procedure.
  • The GUI is adjusted to better highlight clusters which are spread across multiple sites.
  • The failure domain concept has been replaced by tag based domains. ASD nodes and storage routers can now be tagged with one or more tags. Tags can be used to identify a rack, site, power feed, etc.
  • 64TB volumes.
  • Browsable API with Swagger.
  • ‘asd-manager collect logs’ identical to the ‘ovs collect logs’.
  • Support for the removal of the ads-manager packages.

Since this Fargo release introduces a completely new architecture (you can read more about it here) there is no upgrade possible between Eugene and Fargo. The full release notes can be found here.

Accelerated ALBA as read cache

read cache performanceWith the Fargo release we introduce a new architecture which moves the read cache from the Volume Driver to the ALBA backend. I already explained the new backend concepts in a previous blog post but I would also like to shed some light on the various reasons why we took the decision to move the read cache to ALBA. An overview:

Performance

Performance is absolutely the main reason why we decided to move the read cache layer to ALBA. It allows us to remove a big performance bottleneck: locks. When the Volume Driver was in charge of the read cache, we used a hash based upon the volume ID and the LBA to find where the data was stored on the SSD of the Storage Router. When new data was added to the cache – on every write – old data in the cache had to be overwritten. In order to evict data from the cache a linked list was used to track the LRU (Least Recently Used) data. Consequently we had to lock the whole SSD for a while. The lock was required as the hash table (volume ID + LBA) and the linked list had to be updated simultaneously. This write lock also causes delay for read requests as the lock prevents data to be safely read. Basically, in order to increase the performance we had to move towards a lockless read cache where data isn’t updated in place.
This is where ALBA comes in. The ALBA backend doesn’t update data in place but uses a log-structured approach where data is always appended. As ALBA stores chunks of the SCOs, writes are consecutive and large in size. This greatly improves the write bandwidth to the SSDs. ALBA also allows to align cores with the ASD processes and underlying SSDs. By making the whole all-flash ALBA backend core aligned, the overhead of process switching can be minimised. Basically all operations on flash are now asynchronous, core aligned and lockless. All these changes allow Open vStorage to be the fastest distributed block store.

Lower impact of an SSD failure

By moving the read cache to the ALBA backend the impact of an SSD failure is much lower. ALBA allows to perform erasure coding across all SSDs of all nodes in the rack or datacenter. This means the read cache is now distributed and the impact of an SSD failure is limited as only a fraction of the cache is lost. So in case a single SSD fails, there is no reason to go the HDD based capacity backend as the reads can still be fulfilled based upon other fragments of the data which are still cached.

Always hot cache

While Open vStorage has always been capable of supporting live migration, we noticed that with previous versions of the architecture the migrate wasn’t always successful due to the cold cache on the new host. By using the new distributed cache approach, we now have have an always hot cache even in case of (live) migrations.

We hope the above reasons proof that we took the right decision by moving the read cache to ALBA backend. Want to see how you configure the ALBA read cache, check out this GeoScale demo.

The different ALBA Backends explained

open vstorage alba backendsWith the latest release of Open vStorage, Fargo, the backend implementation received a complete revamp in order to better support the geoscale functionality. In a geoscale cluster, the data is spread over multiple datacenters. If one of the datacenters would go offline, the geoscale cluster stays up and running and continues to serve data.

The geoscale functionality is based upon 2 concepts: Backends and vPools. These are probably the 2 most important concepts of the Open vStorage architecture. Allow me to explain in detail what the difference is between a vPool and a Backend.

Backend

A backend is a collections of physical disks, devices or even backends. Next to grouping disks or backends it also defines how data is stored on its constituents. Parameters such as erasure coding/replication factor, compression, encryption need to be defined. Ordinarily a geoscale cluster will have multiple backends. While Eugene, the predecessor release of Fargo, only had 1 type of backend, there are now 2 types: a local and a global backend.

  • A local backend allows to group physical devices. This type is typically used to group disks within the same datacenter.
  • A Global backend allows to combine multiple (local) backends into a single (global) backend. This type of backend typically spans multiple datacenters.

Backends in practice

In each datacenter of an Open vStorage cluster there are multiple local backends. A typical segregation happens based upon the performance of the devices in the datacenter. An SSD backends will be created with devices which are fast and low latency and an HDD backend will be created with slow(er) devices which are optimised for capacity. In some cases the SSD or HDD backend will be split in more backends if they contain many devices for example by selecting every x-th disk of a node. This approach limits the impact of a node failure on a backend.
Note that there is no restriction for a local backend to only use disks within the same datacenter. It is perfectly possible to select disks from different datacenters and add them to the same backend. This doesn’t make sense of course for an SSD backend as the latency between the datacenters will be a performance limiting factor.
Another reason to create multiple backends is if you want to offer each customer his own set of physical disks for security or compliance reasons. In that case a backend is created per customer.

vPool

A vPool is a configuration template for vDisks, volumes being served by Open vStorage. This template contains a whole range of parameters such as blocksize to be used, SCO size on the backend, default write buffer size, preset to be used for data protection, hosts on which the volume can live, the backend where the data needs to be stored and whether data needs to be cached. These last 2 are particularly interesting as they express how different ALBA backends are tied together. When you create a vPool you select a backend to store the volume data. This can be a local backend, SSD for an all-flash experience or a global backend in case you want to spread data over multiple datacenters. This backend is used for every Storage Router serving the vPool. If you use a global backend across multiple datacenters, you will want to use some sort of caching in the local datacenter where the volume is running. Do this in order to keep the read latency as low as possible. To achieve this by assign a local SSD backend when extending a vPool to a certain Storage Router. All volumes being served by that Storage Router will on a read first check if the requested data is in the SSD backend. This means that Storage Routers in different datacenters will use a different cache backend. This approach allows to keep hot data in the local SSD cache and store cold data on the capacity backend which is distributed across datacenters. By using this approach Open vStorage can offer stunning performance while distributing the data across multiple datacenters for safety.

A final note

To summarise, an Open vStorage cluster can have multiple and different ALBA backends: local vs. global backends, SSD and HDD backends. vPools, a grouping of vDisks which share the same config, are the glue between these different backends.

Seagate Kinetic Open Storage Project Plugfest

Open vStorage was invited to host a session during the Seagate Kinetic plugfest on Tuesday, September 20 to demo and discuss advances in Ethernet-connected storage. Kinetic is a drive architecture in which the drive is a key/value server with Ethernet connectivity. With Open vStorage we have created ALBA ASD software that mimics this key/value behaviour for normal SATA drives. Kinetic drives can of course also be used as archiving backend for an Open vStorage cluster.

Read more about the Kinetic Open Storage Project here.

I like to move it, move it

The vibe at the Open vStorage office is these days best explained by a song of the early nineties:

I like to move it, move it ~ Reel 2 Reel

While the summer time is in most companies a more quiet time, the Open vStorage office is buzzing like a beehive. Allow me to give you a short overview of what is happening:

  • We are moving into our new, larger and stylish offices. The address remains the same but we are moving into a completely remodeled floor of the Idola business center.
  • Next to physically moving desks at the Open vStorage HQ, we are also moving our code from BitBucket to GitHub. We have centralized all our code under https://github.com/openvstorage. To list a few of the projects: Arakoon (our consistent distributed key-value store), ALBA (the Open vStorage default ALternate BAckend) and of course Open vStorage itself. Go check it out!
  • Finishing up our Open vStorage 2.2 GA release.
  • Adding support for RedHat and Cent OS by merging in the Cent-OS branch. There is still some work to do around packaging, testing and upgrades so feel free to give a hand. As this was really a community effort, we owe everyone a big thank you.
  • Working on some very cool features (RDMA anyone?) but let’s keep those for a separate post.
  • Preparation for VMworld (San Francisco) and the OpenStack summit in Tokyo.

As you can see, many things going on at once so prepare for a hot Open vStorage fall!

Open vStorage 2.2 alpha 4

We released Open vStorage 2.2 Alpha 4 which contains following bugfixes:

  • Update of the About section under Administration.
  • Open vStorage Backend detail page hangs in some cases.
  • Various bugfixes for the use case when adding a vPool with a vPool name which was previously used.
  • Hardening the vPool removal.
  • Fix daily scrubbing not running.
  • No log output from the scrubber.
  • Failing to create a vDisk from a snapshot tries to delete the snapshot.
  • ALBA discovery starts spinning if network is not available.
  • ASD is no longer used by the proxy even after it has been requalified.
  • Type checking through Descriptor doesn’t work consistently.

Open vStorage 2.2 alpha 3

Today we released Open vStorage 2.2 alpha 3. The only new features are on the Open vStorage Backend (ALBA) front:

  • Metadata is now stored with a higher protection level.
  • The protocol of the ASD is now more flexible in the light of future changes.

Bugfixes:

  • Make it mandatory to configure both read- and writecache during the ovs setup partitioner.
  • During add_vpool on devstack, the cinder.conf is updated with notification_driver which is incorrectly set as “nova.openstack.common.notifier.rpc_notifier” for Juno.
  • Added support for more physical disk configuration layouts.
  • ClusterNotReachableException during vPool changes.
  • Cannot extend vPool with volumes running.
  • Update button clickable when an update is ongoing.
  • Already configured storage nodes are now removed from the discovered ones.
  • Fix for ASDs which don’t start.
  • Issue where a slow long-running task could fail because of a timeout.
  • Message delivery from albamgr to nsm_host can get stuck.
  • Fix for ALBA Namespace doesn’t exists while it exists.