SCOs , chunks & fragments

For frequent readers it is stating the obvious to say that ALBA is a complex piece of software. One of the most dark caves of the ALBA OCaml code is the one where SCOs, the objects coming from the Volume Driver, are split into objects. These objects are subsequently stored on the ASDs in the ALBA backend. It is time to clear up the mist around policies, SCOs, chunks and fragments as uncareful setting of these values might result in performance loss or an explosion of the backend metadata.

The fragment basics

Open vStorage uses an append-only strategy for data written to a volume. Once enough data is accumulated, the Volume Driver hands the log-file, a SCO (Storage Container Object), over to the ALBA proxy. This ALBA proxy is responsible for encrypting, compressing and erasure coding or replicating the SCOs based upon the selected preset. One important part of the preset is the policy (k, m, c, x). These 4 numbers can have a great influence on the performance of your Open vStorage cluster. But for starters, let’s first recap the meaning of these 4 numbers:

  • k: the amount of data fragments
  • m: the amount of parity fragments
  • c: the minimum number of fragments been written before the write is acknowledged
  • x: the maximum number of fragments per storage node

When c is lower than k+m, one or more slow responding ASDs won’t have impact on the write performance to the backend. The fragments which should have been stored on the slow ASD(s) will simply be rewritten at a later point in time by the maintenance process.

This was the easy part of how these numbers can influence the performance. Now comes the hard part. When you have a SCO of let’s say 64MB it is according to the policy split into k data objects and m parity objects. Assume k is set to 8 and hence we should end up with 8 objects of 8MB. There is however another (hidden) value which plays a role: the maximum fragment size. The fragment size does have an impact on the write performance as larger fragments tend to provide higher write bandwidth to the underlying hard disk. It is not a secret that traditional SATA disks love large pieces of consecutive data to write. But on the other hand, the bigger the fragments are, the less relevant they are to cache in the fragment cache and the longer it takes to read them from the backend in case of cache misses. To summarize, the size of the fragments should be big but not too big.

So to make sure fragments are not too big you can set a maximum fragment size. The default maximum fragment size is 4MB. As the fragment size in the example above was 8MB and the maximum fragment size for the backend is only 4MB something will need to happen: chunking. Chunking splits large SCOs into smaller chunks so the fragments of these chunks are smaller than the maximum fragment size. So in our example above the SCO will be split in smaller chunks. To calculate the amount of chunks needed, a simple formula can be used:

Amount of chunks = ROUNDUP(SCO size/min(k*maximum fragment size,SCO size))

In the our example we end up with 2 chunks – roundup(64/min(8*4,64). These 2 chunks are next erasure coded using the (k, m, c, x) policy. Basically you end up with 2 chunks of 8 4MB fragments and per chunk an additional m parity fragments.

Global Backends

So far we only covered the fragment basics so let’s make it a bit more complex by introducing stacked backends. Open vStorage allows multiple local backends to be combined into a global backend. This means there are 2 sets of fragments: the fragments at the global level and at the local level. Let’s continue with our previous example where we had 64MB SCOs and a 4MB fragment size. This means that the fragments which serve as input for the local backends are only 4MB. Assume that we also configure erasure coding with policy (k’, m’, c’, x’) at the local backend level. In that case each 4MB fragment will be split into another k’ fragments and m’ parity fragments. If k’ is for example set to 8, you will end up with 512KB fragments. There are 2 issues with this relatively small size of the fragments. The first issue was already outlined above. Traditional SATA drives are optimized for large chunks of consecutive data and 512KB is probably too small to reach the hard disks’ write bandwidth limit. This means we have suboptimal write performance. The second issue is related to the metadata size. Each object in the ALBA backend is referenced by metadata and in order to optimize the performance all metadata should be kept in RAM. Hence it is essential to keep the data/metadata ratio as high as possible in order to keep the required RAM to address the whole backend under control. In the above example with an (8, 2, c, x) policy for both the global and local backend we would end up with around 10KB of metadata for every 64MB SCO. With an optimal selection of the global policy (4,1, c, x) and a maximum fragment size of 16MB on the global backend, the metadata for the same SCO is only 5KB. This means that with the same amount of RAM reserved for the metadata, twice the amount of backend storage can be addressed. Next to storing the metadata in RAM, the metadata is also persistently store d on disk (NVMe, SSD) in an Arakoon cluster. By default Arakoon uses a 3-way replication scheme so with the optimized settings the metadata will occupy 6 time less disk space. The optimal global policy of (4,1, c, x) will, next to a lower memory footprint for the metadata, also provide better performance as 4MB fragments are written to the SATA drives instead of the smaller 512KB fragments.

Conclusion

Whatever you decide as ABLA backend policy, SCO size and maximum fragment size, choose wisely as these values have an impact on various aspects of the Open vstorage cluster ranging from performance to Total Cost of Ownership (TCO).

Open vStorage High Availability (HA)

Last week I received an interesting question from a customer:

What about High-Availability (HA)? How does Open vStorage protect against failures?

This customer was right to ask that question. In case you run a large scale, multi-petabyte storage cluster, HA should be one of your key concerns. Downtime in such a cluster doesn’t only lead to production loss but might be a real PR disaster or even lead to foreclosure. When end-customers start leaving your service, it can become a slippery slope and before you are aware there is no customer left on your cluster. Hence, asking the HA question beforehand is a best practice for every storage engineer challenged with doing a due diligence of a new storage technology. Over the past few years we already devoted a lot of words to Open vStorage HA so I thought it was time for a summary.

In this blog post I will discuss the different HA scenarios starting from top (the edge) to bottom (the ASD).

The Edge

To start an Edge block device, you need to pass the IP and port of a Storage Router with the vPool of the vDisk. On initial connection the Storage Router will return to the Edge a list of fail-over Storage Routers. The Edge caches this information and switches automatically to another Storage Router in case it can’t communicate with the Storage Router for 15 seconds.
Periodically the Edge also asks the Storage Router to which Storage Router it should connect. This way the Storage Router can instruct the Edge to connect to another Storage Router, for example because the original Storage Router will be shut down.
For more details, check the following blog post about Edge HA.

The Storage Router

The Storage Router also has multiple HA features for the data path. As a vDisk can only be active and owned by a single Volume Driver, the block to object conversion process of the Storage Router, a mechanism is in place to make sure the ownership of the vDisks can be handed over (happy path) or stolen (unhappy path) by another Storage Router. Once the ownership is transferred the volume is started on the new Storage Router and IO requests can be processed. In case the old Storage Router would still try to write to the backend, fencing will kick in which prevents data to be stored on the backend.
The ALBA proxy is responsible for encrypting, compressing and erasure code the Storage Container Objects (SCOs) coming from the Volume Driver and sending the fragments to the ASD processes on the SSD/SATA disks. Each Storage Router also has multiple proxies and can switch between these proxies in cases of issues and timeouts.

The ALBA Backend

An ALBA backend typically consist out of a multiple physical disks across multiple servers. The proxies generate redundant parity fragments via erasure coding which are stored across all devices of the backend. As a result, a device or even a complete server failure doesn’t lead to data loss. On top, backends can be recursively composed. Let’s take as example the case where you have 3 data centers. One could create a (local) backend containing the disks of each data center and create a (global) backend on top of these these (local) backends. Data could for example be replicated 3 times, one copy in each data center, and erasure coded within the data center for storage efficiency. Using this approach a data center outage wouldn’t cause any data loss.

The management path HA

The previous sections of this blog post discussed the HA features of the data path. The management path is also high available. The GUI and API can be reached from all master nodes in the cluster. The metadata is also stored redundantly and is spread across multiple nodes or even data centers. Open vStorage has 2 types of metadata: the volume metadata and the backend metadata. The volume metadata is stored in a networked RocksDB using a master-slave concept. More information about that can be found here and in a video here.
The backend metadata is stored in our own, in-house developed, always consistent key-value store named Arakoon. More info on Arakoon can be found here.

That’s in a nutshell how Open vStorage makes sure a disk, server or data center disaster doesn’t lead to storage downtime.

NSM and ABM, Arakoon teamwork

In an earlier post we shed some light on Arakoon, our own always consistent distributed key-valuedatabase. Arakoon is used in many parts of the Open vStorage platform. One of the use cases is to store the metadata of the native ALBA object store. Do note that ALBA is NOT a general purpose object store but specifically crafted and optimized for Open vStorage. ALBA uses a collection of Arakoon databases to store where and how objects are stored on the disks in the backend. Typically the SCOs and TLogs of each vDisk end up in a separate bucket, a namespace, on the backend. For each object in the namespace there is a manifest that describes where and how the object is stored on the backend. To glue the namespaces, the manifests and the disks in the backend together, ALBA uses 2 types of Arakoon databases: the ALBA Backend Manager (ABM) and one or more NameSpace Managers (NSM).

ALBA Manager

The ALBA Manager (ABM) is the entry point for all ALBA clients which want to store or retrieve something from the backend. The ALBA Manager DB knows which physical disks belong to the backend, which namespaces exist and on which NSM hosts they can be found.
To optimize the Arakoon DB it is loaded with the albamgr plugin, a collection of specific ABM user functions. Typically there is only a single ABM manager in a cluster.

NSM

A NameSpace Manager (NSM) is an Arakoon cluster which holds the manifests for the namespaces assigned to the NSM. Which NSM is managing which namespaces is registered with the ALBA Manager. The NSM is also the remote API offered by the NSM host to manipulate most of the object metadata during normal operation. Its coordinates can be retrieved from the ALBA Manager by (proxy) clients and maintenance agents.

To optimize the Arakoon DB it is loaded with the nsm_host plugin, a collection of specific NSM host user functions. Typically there are multiple NSM clusters for a single ALBA backend. This allows to scale the backend both capacity and performance wise.

IO requests

Let’s have a look at the IO path. Whenever the Volume Driver needs to store an object on the backend, a SCO or a TLog, it hands the object to one of the ALBA proxies on the same host. The ALBA proxy contains an ALBA client which communicates with the ABM to know on which NSM and disks it can store the object. Once the object is stored on the disks, the manifest with the metadata is registered in the NSM. For performance reasons the different fragment of the object and the manifest can be cached by the ALBA proxy.

In case the Volume Driver needs data from the backend, because it is no longer in the write buffer, it request the proxy to fetch the exact data by asked for a SCO location and offset. In case the right fragment are in the fragment cache, the proxy returns the data immediately to the Volume Driver. Otherwise it can use the manifest from the cache or the manifest isn’t in the cache, the proxy contacts the ABM to get the right NSM and from that the manifest. Based upon the manifest the ALBA client fetches the data it needs from the physical disks and provides it to the Volume Driver.

Fargo GA

After 3 Release Candidates and extensive testing, the Open vStorage team is proud to announce the GA (General Availability) release of Fargo. This release is packed with new features. Allow us to give a small overview:

NC-ECC presets (global and local policies)

NC-ECC (Network Connected-Error Correction Code) is an algorithm to store Storage Container Objects (SCOs) safely in multiple data centers. It consists out of a global, across data center, preset and multiple local, within a single data center, presets. The NC-ECC algorithm is based on forward error correction codes and is further optimized for usage with a multi data center approach. When there is a disk or node failure, additional chunks will be created using only data from within the same data center. This ensures the bandwidth between data centers isn’t stressed in case of a simple disk failure.

Multi-level ALBA

The ALBA backend now supports different levels. An all SSD ALBA backend can be used as performance layer in front of the capacity tier. Data is removed from the cache layer using a random eviction or Least Recently Used (LRU) strategy.

Open vStorage Edge

The Open vStorage Edge is a lightweight block driver which can be installed on Linux hosts and connect with the Volume Driver over the network (TCP-IP). By creating different components for the Volume Driver and the Edge compute and storage can scale independently.

Performance optimized Volume Driver

By limiting the size of a volume’s metadata, the metadata now fits completely in RAM. To keep the metadata at an absolute minimum, deduplication was removed. You can read more about why we removed deduplication here. Other optimizations are multiple proxies per Volume Driver (the default amount is 2), bypassing the proxy and go straight from the Volume Driver to the ASD in case of partial reads, local read preference in case of global backends (try to read from ASDs in the same data center instead of going over the network to another data center).

Multiple ASDs per device

For low latency devices adding multiple ASDs per device provides a higher bandwidth to the device.

Distributed Config Management

When you are managing large clusters, keeping the configuration of every system up to date can be quite a challenge. With Fargo all config files are now stored in a distributed config management system on top of our distributed database, Arakoon. More info can be found here.

Ubuntu 16.04

Open vStorage is now supported on Ubuntu 16.04, the latest Long Term Support (LTS) version of Ubuntu.

Smaller features in Fargo:

  • Improved the speed of the non-cached API and GUI queries by a factor 10 to 30.
  • Hardening the remove node procedure.
  • The GUI is adjusted to better highlight clusters which are spread across multiple sites.
  • The failure domain concept has been replaced by tag based domains. ASD nodes and storage routers can now be tagged with one or more tags. Tags can be used to identify a rack, site, power feed, etc.
  • 64TB volumes.
  • Browsable API with Swagger.
  • ‘asd-manager collect logs’ identical to the ‘ovs collect logs’.
  • Support for the removal of the ads-manager packages.

Since this Fargo release introduces a completely new architecture (you can read more about it here) there is no upgrade possible between Eugene and Fargo. The full release notes can be found here.

Accelerated ALBA as read cache

read cache performanceWith the Fargo release we introduce a new architecture which moves the read cache from the Volume Driver to the ALBA backend. I already explained the new backend concepts in a previous blog post but I would also like to shed some light on the various reasons why we took the decision to move the read cache to ALBA. An overview:

Performance

Performance is absolutely the main reason why we decided to move the read cache layer to ALBA. It allows us to remove a big performance bottleneck: locks. When the Volume Driver was in charge of the read cache, we used a hash based upon the volume ID and the LBA to find where the data was stored on the SSD of the Storage Router. When new data was added to the cache – on every write – old data in the cache had to be overwritten. In order to evict data from the cache a linked list was used to track the LRU (Least Recently Used) data. Consequently we had to lock the whole SSD for a while. The lock was required as the hash table (volume ID + LBA) and the linked list had to be updated simultaneously. This write lock also causes delay for read requests as the lock prevents data to be safely read. Basically, in order to increase the performance we had to move towards a lockless read cache where data isn’t updated in place.
This is where ALBA comes in. The ALBA backend doesn’t update data in place but uses a log-structured approach where data is always appended. As ALBA stores chunks of the SCOs, writes are consecutive and large in size. This greatly improves the write bandwidth to the SSDs. ALBA also allows to align cores with the ASD processes and underlying SSDs. By making the whole all-flash ALBA backend core aligned, the overhead of process switching can be minimised. Basically all operations on flash are now asynchronous, core aligned and lockless. All these changes allow Open vStorage to be the fastest distributed block store.

Lower impact of an SSD failure

By moving the read cache to the ALBA backend the impact of an SSD failure is much lower. ALBA allows to perform erasure coding across all SSDs of all nodes in the rack or datacenter. This means the read cache is now distributed and the impact of an SSD failure is limited as only a fraction of the cache is lost. So in case a single SSD fails, there is no reason to go the HDD based capacity backend as the reads can still be fulfilled based upon other fragments of the data which are still cached.

Always hot cache

While Open vStorage has always been capable of supporting live migration, we noticed that with previous versions of the architecture the migrate wasn’t always successful due to the cold cache on the new host. By using the new distributed cache approach, we now have have an always hot cache even in case of (live) migrations.

We hope the above reasons proof that we took the right decision by moving the read cache to ALBA backend. Want to see how you configure the ALBA read cache, check out this GeoScale demo.

The different ALBA Backends explained

open vstorage alba backendsWith the latest release of Open vStorage, Fargo, the backend implementation received a complete revamp in order to better support the geoscale functionality. In a geoscale cluster, the data is spread over multiple datacenters. If one of the datacenters would go offline, the geoscale cluster stays up and running and continues to serve data.

The geoscale functionality is based upon 2 concepts: Backends and vPools. These are probably the 2 most important concepts of the Open vStorage architecture. Allow me to explain in detail what the difference is between a vPool and a Backend.

Backend

A backend is a collections of physical disks, devices or even backends. Next to grouping disks or backends it also defines how data is stored on its constituents. Parameters such as erasure coding/replication factor, compression, encryption need to be defined. Ordinarily a geoscale cluster will have multiple backends. While Eugene, the predecessor release of Fargo, only had 1 type of backend, there are now 2 types: a local and a global backend.

  • A local backend allows to group physical devices. This type is typically used to group disks within the same datacenter.
  • A Global backend allows to combine multiple (local) backends into a single (global) backend. This type of backend typically spans multiple datacenters.

Backends in practice

In each datacenter of an Open vStorage cluster there are multiple local backends. A typical segregation happens based upon the performance of the devices in the datacenter. An SSD backends will be created with devices which are fast and low latency and an HDD backend will be created with slow(er) devices which are optimised for capacity. In some cases the SSD or HDD backend will be split in more backends if they contain many devices for example by selecting every x-th disk of a node. This approach limits the impact of a node failure on a backend.
Note that there is no restriction for a local backend to only use disks within the same datacenter. It is perfectly possible to select disks from different datacenters and add them to the same backend. This doesn’t make sense of course for an SSD backend as the latency between the datacenters will be a performance limiting factor.
Another reason to create multiple backends is if you want to offer each customer his own set of physical disks for security or compliance reasons. In that case a backend is created per customer.

vPool

A vPool is a configuration template for vDisks, volumes being served by Open vStorage. This template contains a whole range of parameters such as blocksize to be used, SCO size on the backend, default write buffer size, preset to be used for data protection, hosts on which the volume can live, the backend where the data needs to be stored and whether data needs to be cached. These last 2 are particularly interesting as they express how different ALBA backends are tied together. When you create a vPool you select a backend to store the volume data. This can be a local backend, SSD for an all-flash experience or a global backend in case you want to spread data over multiple datacenters. This backend is used for every Storage Router serving the vPool. If you use a global backend across multiple datacenters, you will want to use some sort of caching in the local datacenter where the volume is running. Do this in order to keep the read latency as low as possible. To achieve this by assign a local SSD backend when extending a vPool to a certain Storage Router. All volumes being served by that Storage Router will on a read first check if the requested data is in the SSD backend. This means that Storage Routers in different datacenters will use a different cache backend. This approach allows to keep hot data in the local SSD cache and store cold data on the capacity backend which is distributed across datacenters. By using this approach Open vStorage can offer stunning performance while distributing the data across multiple datacenters for safety.

A final note

To summarise, an Open vStorage cluster can have multiple and different ALBA backends: local vs. global backends, SSD and HDD backends. vPools, a grouping of vDisks which share the same config, are the glue between these different backends.

Seagate Kinetic Open Storage Project Plugfest

Open vStorage was invited to host a session during the Seagate Kinetic plugfest on Tuesday, September 20 to demo and discuss advances in Ethernet-connected storage. Kinetic is a drive architecture in which the drive is a key/value server with Ethernet connectivity. With Open vStorage we have created ALBA ASD software that mimics this key/value behaviour for normal SATA drives. Kinetic drives can of course also be used as archiving backend for an Open vStorage cluster.

Read more about the Kinetic Open Storage Project here.

I like to move it, move it

The vibe at the Open vStorage office is these days best explained by a song of the early nineties:

I like to move it, move it ~ Reel 2 Reel

While the summer time is in most companies a more quiet time, the Open vStorage office is buzzing like a beehive. Allow me to give you a short overview of what is happening:

  • We are moving into our new, larger and stylish offices. The address remains the same but we are moving into a completely remodeled floor of the Idola business center.
  • Next to physically moving desks at the Open vStorage HQ, we are also moving our code from BitBucket to GitHub. We have centralized all our code under https://github.com/openvstorage. To list a few of the projects: Arakoon (our consistent distributed key-value store), ALBA (the Open vStorage default ALternate BAckend) and of course Open vStorage itself. Go check it out!
  • Finishing up our Open vStorage 2.2 GA release.
  • Adding support for RedHat and Cent OS by merging in the Cent-OS branch. There is still some work to do around packaging, testing and upgrades so feel free to give a hand. As this was really a community effort, we owe everyone a big thank you.
  • Working on some very cool features (RDMA anyone?) but let’s keep those for a separate post.
  • Preparation for VMworld (San Francisco) and the OpenStack summit in Tokyo.

As you can see, many things going on at once so prepare for a hot Open vStorage fall!

Open vStorage 2.2 alpha 4

We released Open vStorage 2.2 Alpha 4 which contains following bugfixes:

  • Update of the About section under Administration.
  • Open vStorage Backend detail page hangs in some cases.
  • Various bugfixes for the use case when adding a vPool with a vPool name which was previously used.
  • Hardening the vPool removal.
  • Fix daily scrubbing not running.
  • No log output from the scrubber.
  • Failing to create a vDisk from a snapshot tries to delete the snapshot.
  • ALBA discovery starts spinning if network is not available.
  • ASD is no longer used by the proxy even after it has been requalified.
  • Type checking through Descriptor doesn’t work consistently.