SCOs , chunks & fragments

For frequent readers it is stating the obvious to say that ALBA is a complex piece of software. One of the most dark caves of the ALBA OCaml code is the one where SCOs, the objects coming from the Volume Driver, are split into objects. These objects are subsequently stored on the ASDs in the ALBA backend. It is time to clear up the mist around policies, SCOs, chunks and fragments as uncareful setting of these values might result in performance loss or an explosion of the backend metadata.

The fragment basics

Open vStorage uses an append-only strategy for data written to a volume. Once enough data is accumulated, the Volume Driver hands the log-file, a SCO (Storage Container Object), over to the ALBA proxy. This ALBA proxy is responsible for encrypting, compressing and erasure coding or replicating the SCOs based upon the selected preset. One important part of the preset is the policy (k, m, c, x). These 4 numbers can have a great influence on the performance of your Open vStorage cluster. But for starters, let’s first recap the meaning of these 4 numbers:

  • k: the amount of data fragments
  • m: the amount of parity fragments
  • c: the minimum number of fragments been written before the write is acknowledged
  • x: the maximum number of fragments per storage node

When c is lower than k+m, one or more slow responding ASDs won’t have impact on the write performance to the backend. The fragments which should have been stored on the slow ASD(s) will simply be rewritten at a later point in time by the maintenance process.

This was the easy part of how these numbers can influence the performance. Now comes the hard part. When you have a SCO of let’s say 64MB it is according to the policy split into k data objects and m parity objects. Assume k is set to 8 and hence we should end up with 8 objects of 8MB. There is however another (hidden) value which plays a role: the maximum fragment size. The fragment size does have an impact on the write performance as larger fragments tend to provide higher write bandwidth to the underlying hard disk. It is not a secret that traditional SATA disks love large pieces of consecutive data to write. But on the other hand, the bigger the fragments are, the less relevant they are to cache in the fragment cache and the longer it takes to read them from the backend in case of cache misses. To summarize, the size of the fragments should be big but not too big.

So to make sure fragments are not too big you can set a maximum fragment size. The default maximum fragment size is 4MB. As the fragment size in the example above was 8MB and the maximum fragment size for the backend is only 4MB something will need to happen: chunking. Chunking splits large SCOs into smaller chunks so the fragments of these chunks are smaller than the maximum fragment size. So in our example above the SCO will be split in smaller chunks. To calculate the amount of chunks needed, a simple formula can be used:

Amount of chunks = ROUNDUP(SCO size/min(k*maximum fragment size,SCO size))

In the our example we end up with 2 chunks – roundup(64/min(8*4,64). These 2 chunks are next erasure coded using the (k, m, c, x) policy. Basically you end up with 2 chunks of 8 4MB fragments and per chunk an additional m parity fragments.

Global Backends

So far we only covered the fragment basics so let’s make it a bit more complex by introducing stacked backends. Open vStorage allows multiple local backends to be combined into a global backend. This means there are 2 sets of fragments: the fragments at the global level and at the local level. Let’s continue with our previous example where we had 64MB SCOs and a 4MB fragment size. This means that the fragments which serve as input for the local backends are only 4MB. Assume that we also configure erasure coding with policy (k’, m’, c’, x’) at the local backend level. In that case each 4MB fragment will be split into another k’ fragments and m’ parity fragments. If k’ is for example set to 8, you will end up with 512KB fragments. There are 2 issues with this relatively small size of the fragments. The first issue was already outlined above. Traditional SATA drives are optimized for large chunks of consecutive data and 512KB is probably too small to reach the hard disks’ write bandwidth limit. This means we have suboptimal write performance. The second issue is related to the metadata size. Each object in the ALBA backend is referenced by metadata and in order to optimize the performance all metadata should be kept in RAM. Hence it is essential to keep the data/metadata ratio as high as possible in order to keep the required RAM to address the whole backend under control. In the above example with an (8, 2, c, x) policy for both the global and local backend we would end up with around 10KB of metadata for every 64MB SCO. With an optimal selection of the global policy (4,1, c, x) and a maximum fragment size of 16MB on the global backend, the metadata for the same SCO is only 5KB. This means that with the same amount of RAM reserved for the metadata, twice the amount of backend storage can be addressed. Next to storing the metadata in RAM, the metadata is also persistently store The write bandwidth to the backend will on top be higher as 4MB fragments are written to the SATA drives instead of the smaller 512KB fragments.d on disk (NVMe, SSD) in an Arakoon cluster. By default Arakoon uses a 3-way replication scheme so with the optimized settings the metadata will occupy 6 time less disk space.

Conclusion

Whatever you decide as ABLA backend policy, SCO size and maximum fragment size, choose wisely as these values have an impact on various aspects of the Open vstorage cluster ranging from performance to Total Cost of Ownership (TCO).

Open vStorage High Availability (HA)

Last week I received an interesting question from a customer:

What about High-Availability (HA)? How does Open vStorage protect against failures?

This customer was right to ask that question. In case you run a large scale, multi-petabyte storage cluster, HA should be one of your key concerns. Downtime in such a cluster doesn’t only lead to production loss but might be a real PR disaster or even lead to foreclosure. When end-customers start leaving your service, it can become a slippery slope and before you are aware there is no customer left on your cluster. Hence, asking the HA question beforehand is a best practice for every storage engineer challenged with doing a due diligence of a new storage technology. Over the past few years we already devoted a lot of words to Open vStorage HA so I thought it was time for a summary.

In this blog post I will discuss the different HA scenarios starting from top (the edge) to bottom (the ASD).

The Edge

To start an Edge block device, you need to pass the IP and port of a Storage Router with the vPool of the vDisk. On initial connection the Storage Router will return to the Edge a list of fail-over Storage Routers. The Edge caches this information and switches automatically to another Storage Router in case it can’t communicate with the Storage Router for 15 seconds.
Periodically the Edge also asks the Storage Router to which Storage Router it should connect. This way the Storage Router can instruct the Edge to connect to another Storage Router, for example because the original Storage Router will be shut down.
For more details, check the following blog post about Edge HA.

The Storage Router

The Storage Router also has multiple HA features for the data path. As a vDisk can only be active and owned by a single Volume Driver, the block to object conversion process of the Storage Router, a mechanism is in place to make sure the ownership of the vDisks can be handed over (happy path) or stolen (unhappy path) by another Storage Router. Once the ownership is transferred the volume is started on the new Storage Router and IO requests can be processed. In case the old Storage Router would still try to write to the backend, fencing will kick in which prevents data to be stored on the backend.
The ALBA proxy is responsible for encrypting, compressing and erasure code the Storage Container Objects (SCOs) coming from the Volume Driver and sending the fragments to the ASD processes on the SSD/SATA disks. Each Storage Router also has multiple proxies and can switch between these proxies in cases of issues and timeouts.

The ALBA Backend

An ALBA backend typically consist out of a multiple physical disks across multiple servers. The proxies generate redundant parity fragments via erasure coding which are stored across all devices of the backend. As a result, a device or even a complete server failure doesn’t lead to data loss. On top, backends can be recursively composed. Let’s take as example the case where you have 3 data centers. One could create a (local) backend containing the disks of each data center and create a (global) backend on top of these these (local) backends. Data could for example be replicated 3 times, one copy in each data center, and erasure coded within the data center for storage efficiency. Using this approach a data center outage wouldn’t cause any data loss.

The management path HA

The previous sections of this blog post discussed the HA features of the data path. The management path is also high available. The GUI and API can be reached from all master nodes in the cluster. The metadata is also stored redundantly and is spread across multiple nodes or even data centers. Open vStorage has 2 types of metadata: the volume metadata and the backend metadata. The volume metadata is stored in a networked RocksDB using a master-slave concept. More information about that can be found here and in a video here.
The backend metadata is stored in our own, in-house developed, always consistent key-value store named Arakoon. More info on Arakoon can be found here.

That’s in a nutshell how Open vStorage makes sure a disk, server or data center disaster doesn’t lead to storage downtime.

Docker and persistent Open vStorage volumes

Docker, the open-source container platform, is currently one of the hottest projects in the IT infrastructure business. With support of some of the world’s leading companies such as PayPal, Ebay, General Electric and many more, it is quickly becoming a cornerstone of any large deployment. Next, it also introduces a paradigm shift in how administrators see servers and applications.

Pets vs. cattle

In the past servers were treated like dogs and cats or any family pet: you give it a cute name, make sure it is in optimal condition, take care of it when it is sick, … With VMs a shift already occurred: names became more general like WebServer012 but keeping the VM healthy was still a priority for administrators. With Docker, VMs are decomposed into a sprawl of individual, clearly, well-defined applications. Sometimes there can even be multiple instances of the same application running at the same time. With thousands of containerized applications running on a single platform, it becomes impossible to treat these applications as pets but instead they are treated as cattle: they get an ID, when having issues they are taken off-line, terminated, and replaced.

Docker Storage

The original idea behind Docker was that containers would be stateless and hence didn’t need persistent storage. But over the years the insight has grown that also some applications and hence containers require persistent storage. Since the Docker platforms at large companies are housing thousands of containers, the required storage is also significant. Typically these platforms also span multiple locations or even clouds. Storage across locations and clouds is the sweet spot of the Open vStorage feature set. In order to offer distributed, persistent storage to containers, the Open vStorage team created a Docker plugin on top of the Open vStorage Edge, our lightweight block device. Note that the Docker plugin is part of the Open vStorage Enterprise Edition.

Open vStorage and Docker

Using Open vStorage to provision volumes for Docker is easy and straightforward thanks to Docker’s volume plugin system. To show how easy it is to create a volume for a container, I will give you the steps to run Minio, a minimal , open-source object store, on top of a vDisk.

First install the Open vStorage Docker plugin and the necessary packages on the compute host running Docker:
apt-get install libovsvolumedriver-ee blktap-openvstorage-ee-utils blktap-dkms volumedriver-ee-docker-plugin

Configure the configuration of the plugin by updating /etc/volumedriver-ee-docker-plugin/config.toml

[volumedriver]
hostname="IP"
port=26203
protocol="tcp"
username="root"
password="rooter"

Change the IP and port to the IP on which the vPool is exposed on the Storage Router you want to connect to (see Storage Router detail page).

Start the plugin service
systemctl start volumedriver-ee-docker-plugin.service

Create the Minio container and attach a disk for the data (minio_export) and one for the config (minio_config)

docker run --volume-driver=ovs-volumedriver-ee -p 9000:9000 --name minio \
-v minio_export:/export \
-v minio_config:/root/.minio \
minio/minio server /export

That is it. You now have a Minio object store running which stores its data on Open vStorage.

PS. Want to see more? Check the “Docker fun across Amazon Google and Packet.net”-video

Cache Policy Management: A Closer Look

Don’t you hate a noisy neighbour? Someone who blasts his preferred music just loud enough so you can hear it when trying to get some sleep or having a relaxing commute. Well the same goes for noisy neighbours in storage. It is not their deafening music that is annoying but the fact that other volumes can’t meet their desired performance as one volume gobbles up all IOPS.

Setting cache quota

This situation typically occurs when a single volume takes up the whole cache. In order to allow every vDisk to get a fair share of the cache, the Open vStorage Enterprise Edition allows to put a quota on the cache usage. When creating a vPool you can set a default quota per vDisk allowing each vDisk to get a fair share of the cache. Do note that the quota system is flexible. It is for example possible to set a larger value than the default for a specific vDisk in case it would benefit from more caching. It is even possible to oversubscribe the cache. This way the cache space can be optimally used.

Block and Fragment cache

One more point about cache management in Open vStorage. There are actually 2 types of cache which can be configured in Open vStorage. The first one caches complete fragments, the result of erasure coding a Storage Container Object (SCO). Hence it is called the fragment cache and it is typically used for newly written data. The stored fragments are typically large in size as to limit the amount of metadata and consequently these aren’t ideal to be used for (read) caching. The cache hit ratio is under normal circumstances inversely proportional to the size of the fragments. For that reason another cache, specifically tuned for read caching, was added. This block cache gets filled on reads and limits the size of the blocks in the cache to a couple of KB (f.e. 32-256KB). This means a more granular approach can be taken during cache eviction, eventually leading to a higher cache hit ratio.

The Open vStorage High Performance Read Mesh (HPRM)

When you are developing a storage solution your biggest worry is data loss. As an Open vStorage platform can lose a server or even a complete data center without actual data loss, we are pretty sure we have that base covered. The next challenge is to make sure that safely stored data can be quickly accessed when needed. In this blog section we already discussed a lot of the performance improvements we made over the past releases. We introduced the Edge component for guaranteed performance, the accelerated ALBA as read cache, multiple proxies per volume driver and various performance tuning options.

Today it is time to introduce the latest performance improvement: High Performance Read Mesh (HPRM). This HPRM is an optimization of the read path and allows the compute host to directly fetch the data from the drives where the data is located. Earlier the read path always had to go through the Volume Driver before the data was fetched from the ASD. This newly introduced short read path can only be taken in case the Edge has the necessary metadata of where (SCO, fragment, disk) each LBA’s data is stored. In case the Edge doesn’t have the needed metadata, for example because the cached metadata is outdated, the slow path is taken through the Volume Driver. For the write path nothing is changed as all writes go through the Volume Driver.

The short read path which bypasses the Volume Driver has 2 direct advantages: lower latency on reads and less network traffic as data only goes once over the network. Next, the introduction of the HPRM also allows for a cost reduction on the hardware front. Since the hosts running the Volume Driver are no longer in the read path in many cases, they are freed up and can focus on processing incoming writes. This means the ratio between compute hosts running the Edge and the Volume Driver can be increased. Since the Volume Driver hosts are typically beefy servers with expensive NVMe devices for the write buffer and the distributed databases, a significant change in the Compute/Volume Driver ratio means a significant reduction of the hardware cost.

HPRM, the technical details

Let’s have a look under the hood on how the HPRM works. First we will have a look at the write path. The application, f.e. the hypervisor, writes to the block device exposed by the Edge client. The Edge client will connect to its server part which in its turn, writes the data to the write buffer of the Volume Driver. Once enough writes are accumulated in the buffer, a SCO (Storage Container Object) is created and dispatched to the ALBA backend through the proxy. The proxy makes sure the data is spread across different ASDs according to the specified ALBA preset. Which ASDs contain the fragments of the SCO is stored in a manifest.
Once a read comes for the LBA, the Edge client will check its local metadata cache for the SCO info and manifest of the SCO. If the info is available the Edge will get the LBA data through the PRACC (Partial Read ACCelerator) client which can directly fetch the data from the ASDs. If the info isn’t available in the cache or if it is outdated, the manifest and SCO info are retrieved by the Edge client from the Volume Driver and stored in the Edge metadata cache.
The Edge also pushes the IO statistics to the Volume Driver so these can be queried by the Framework or the monitoring components. Gathering IO statistics is done by the Edge as it is the only component that has a view on both the fast path, through the PRACC, and the slow path through the Volume Driver.


Note that the High Performance Read Mesh is part of the Open vStorage Enterprise Edition. Contact us for more info on the Open vStorage Enterprise Edition.

Connecting Open vStorage with Amazon

In an earlier blog post we already discussed that Open vStorage is the storage solution to implement a hybrid cloud. In this blog post we will explain the technical details on how Open vStorage can be used in a hybrid cloud context.

The components

For frequent readers of this blog the different Open vStorage components should not hold any secrets anymore. For newcomers we will give a short overview of the different components:

  • The Edge: a lightweight software component which exposes a block device API and connects across the network to the Volume Driver.
  • The Volume Driver: a log structured volume manager which converts blocks into objects.
  • The ALBA Backend: an object store optimized as backend for the Volume Driver.

Let’s see how these components fit together in a hybrid cloud context.

The architecture

The 2 main components of any hybrid cloud are an on-site, private part and a public part. Key in a hybrid cloud is that data and compute can move between the private and the public part as needed. As part of this thought exercise we take the example where we want to store data on premises in our private cloud and burst with compute into the public cloud when needed. To achieve this we need to install the components as follows:

The Private Cloud part
In the private cloud we install the ALBA backend components to create one or more storage pools. All SATA disks are gathered in a capacity backend while the SSD devices are gathered in a performance backend which accelerates the capacity backend. On top of these storage pools we will deploy one or more vPools. To achieve this we run a couple of Volume Driver instances inside our private cloud. On-site compute nodes with the Edge component installed can use these Volume Drivers to store data on the capacity backend.

The Public Cloud part
For the Public Cloud part, let’s assume we use Amazon AWS, there are multiple options depending on the desired performance. In case we don’t require a lot of performance we can use an Amazon EC2 instance with KVM and the Edge installed. To bring a vDisk live in Amazon, a connection is made across the internet With the Volume Driver in the private cloud. Alternatively an AWS Direct Connect link can be used for a lower latency connection. Writes to Vdisk which is exposed in Amazon will be sent by the Edge to the write buffer of the Volume Driver. This means that writes will only be acknowledged to the application using the vDisk once the on premises located write buffer has received the data. Since the Edge and the Volume Driver connect over a rather high latency link, the write performance isn’t optimal in this case.
In case more performance is required we need an additional Storage Optimized EC2 instance with one or more NVMe SSDs. In this second EC2 instance a Volume Driver instance is installed and the vPool is extended from the on-site, private cloud into Amazon. The NVMe devices of the EC2 instance are used to store the write buffer and the metadata DBs. It is of course possible to add some more EBS Provisioned IOPS SSDs to the EC2 instance as read cache. For an even better performance, use dedicated Open vStorage powered cache nodes in Amazon. Since the write buffer is located in Amazon the latency will be substantially lower than in the first setup.

Use cases

As last part of this blog post we want to discuss some use cases which can be deployed on top of this hybrid cloud.

Analytics
Note that based upon the above architecture, a vDisk in the private cloud can be cloned into Amazon. The cloned vDisk can be used for business analytics inside Amazon without impacting the live workloads. When the analytics query is finished, the clone can be removed. The other way around is of course also possible. In that case the application data is stored in Amazon while the business analytics run on on-site compute hardware.

Disaster Recovery
Another use case is disaster recovery. As disaster recovery requires data to be on premises but also in the cloud additional instance need to be added with a large amount of HDD disks. Replication or erasure coding can be used to spread the data across the private and public cloud. In case of a disaster where the private cloud is destroyed, one can just add more compute instances running the Edge to bring the workloads live in the public cloud.

Data Safety
A last use case we want to highlight is for users that want to use public clouds but don’t thrust these public cloud providers with all of their data. In that case you need to get some instances in each public cloud which are optimized for storing data. Erasure coding is used to chop the data in encrypted fragments. These fragments are spread across the public clouds in such a way that non of the public clouds store the complete data set while the Edges and the Volume Drivers still can see the whole data set.

Keeping an eye on an Open vStorage cluster

Open vStorage offers as part of the commercial package 2 options to monitor an Open vStorage cluster. The OPS team acts as a second set of eyes or the OPS team has the keys, is in the driving seat and has full control. In both cases these large scale (+5PB) Open vStorage clusters send the logs to a centralized monitoring cluster managed by the OPS team. This custom monitoring cluster is based based upon scalable tools such as Elasticsearch, InfluxDB, Kibana, Grafana and CheckMK. Let’s have a look at the different components the OPS team uses. Note that these tools are only part of the Open vStorage commercial package.

Elasticsearch & Kibana

To expose the internals of an Open vStorage cluster, the team opted to run an ELK (Elasticsearach, Logstash, Kibana) stack to gather logging information and centralise all this information into a single viewing pane.

The ELK-stack consists of 3 open source components:

  • Elasticsearch: a NoSQL database, based on Apache’s Lucene engine, which stores all log files.
  • Logstash: a log pipeline tool which accepts various inputs and targets. In our case, it will read logging from a Redis queue and store them into Elasticsearch.
  • Kibana: a visualisation tool on top of Elasticsearch.

Next to the ELK stack, Journalbeat is used to fetch the logging from all nodes of the cluster and put them onto Redis. Logstash consumes the Redis queue and stores the log messages into Elasticsearch. By aggregating all logs from a cluster into a single, unified view, detecting anomalies or finding correlation between issues is easier.

InfluxDB & Grafana

The many statistics that are being tracked are stored into an InfluxDB, an open source database specifically designed to handle time series data. On top of the InfluxDB Grafana is used to visualize these statistics. The dashboards give a detailed view on the performance metrics of the cluster as a whole but also of the individual components. The statistics are provided in an aggregated view but a OPS member can also drill down to the smallest detail such as the individual vDisks level. The metrics that are tracked range from IO latency at different levels, throughput and operations per second, safety of the objects in the backend to the amount of maintenance tasks that are running across the cluster.

CheckMK

To detect and escalate issues the Open vStorage team uses CheckMK, an extension to the open source Nagios monitoring system. The CheckMK cluster is loaded with many monitoring rules based upon years of experience in monitoring large scale (storage) clusters. These monitoring rules includes general checks such as the CPU and RAM of a host, the services, network performance and disk health but of course specific checks for Open vStorage components such as the Volume Driver or Arakoon have also been added. The output of the healthcheck also gets parsed by the CheckMK engine. In case of issues a fine-tuned escalation process is put into motion in order to resolve these issues quickly.

Accelerated ALBA as read cache

read cache performanceWith the Fargo release we introduce a new architecture which moves the read cache from the Volume Driver to the ALBA backend. I already explained the new backend concepts in a previous blog post but I would also like to shed some light on the various reasons why we took the decision to move the read cache to ALBA. An overview:

Performance

Performance is absolutely the main reason why we decided to move the read cache layer to ALBA. It allows us to remove a big performance bottleneck: locks. When the Volume Driver was in charge of the read cache, we used a hash based upon the volume ID and the LBA to find where the data was stored on the SSD of the Storage Router. When new data was added to the cache – on every write – old data in the cache had to be overwritten. In order to evict data from the cache a linked list was used to track the LRU (Least Recently Used) data. Consequently we had to lock the whole SSD for a while. The lock was required as the hash table (volume ID + LBA) and the linked list had to be updated simultaneously. This write lock also causes delay for read requests as the lock prevents data to be safely read. Basically, in order to increase the performance we had to move towards a lockless read cache where data isn’t updated in place.
This is where ALBA comes in. The ALBA backend doesn’t update data in place but uses a log-structured approach where data is always appended. As ALBA stores chunks of the SCOs, writes are consecutive and large in size. This greatly improves the write bandwidth to the SSDs. ALBA also allows to align cores with the ASD processes and underlying SSDs. By making the whole all-flash ALBA backend core aligned, the overhead of process switching can be minimised. Basically all operations on flash are now asynchronous, core aligned and lockless. All these changes allow Open vStorage to be the fastest distributed block store.

Lower impact of an SSD failure

By moving the read cache to the ALBA backend the impact of an SSD failure is much lower. ALBA allows to perform erasure coding across all SSDs of all nodes in the rack or datacenter. This means the read cache is now distributed and the impact of an SSD failure is limited as only a fraction of the cache is lost. So in case a single SSD fails, there is no reason to go the HDD based capacity backend as the reads can still be fulfilled based upon other fragments of the data which are still cached.

Always hot cache

While Open vStorage has always been capable of supporting live migration, we noticed that with previous versions of the architecture the migrate wasn’t always successful due to the cold cache on the new host. By using the new distributed cache approach, we now have have an always hot cache even in case of (live) migrations.

We hope the above reasons proof that we took the right decision by moving the read cache to ALBA backend. Want to see how you configure the ALBA read cache, check out this GeoScale demo.

The Edge, a lightweight block device

edge block storageWhen I present the new Open vStorage architecture for Fargo, I almost always receive the following Edge question:

What is the Edge and why did you develop it?

What is the Edge about?

The Edge is a lightweight software component which can be installed on a Linux host. It exposes a block device API and connects to the Storage Router across the network (TCP/IP or RDMA). Basically the applications believes it talks to a local block device (the Edge) while the volume actually runs on another host (Storage Router).

Why did we develop the Edge?

The reason why we have developed the Edge is quite simple: componentization. With Open vStorage we are mainly dealing with large, multi-petabyte deployments and having this Edge component gives additional benefits in large environments:

Scalability

In large environments you want to be able to scale the compute and storage part independently. In case you run Open vStorage hyper-converged, as advised with earlier versions, this isn’t possible. This has as consequence that if you need more RAM or CPU to run VMs, you had to also invest in more SSDs. With the Edge you can scale compute and storage independent.

Guaranteed performance

With Eugene the Volume Driver, the high performance distributed block layer, was running on the compute host together with the VMs. This results in the VMs and the Volume Driver fighting for the same CPU and RAM resources. This is a typical issue with hyper-converged solutions. The Edge component avoids this problem as it runs on the compute hosts (and requires only a small amount of resources) and the Volume Drivers runs on dedicated nodes and hence provides a predictable and consistent amount of IOPS to the VMs.

Limit the Impact of Updates

Storage software updates are a (storage) administrator’s worst nightmare. In previous Open vStorage versions an update of the Volume Driver required all VMs on that node to be migrated or brought down.With the Edge the Volume Driver can be updated in the background as each Edge/compute host has HA features and can automatically connect to another Volume Driver on request without the need of a VM migration.

The different ALBA Backends explained

open vstorage alba backendsWith the latest release of Open vStorage, Fargo, the backend implementation received a complete revamp in order to better support the geoscale functionality. In a geoscale cluster, the data is spread over multiple datacenters. If one of the datacenters would go offline, the geoscale cluster stays up and running and continues to serve data.

The geoscale functionality is based upon 2 concepts: Backends and vPools. These are probably the 2 most important concepts of the Open vStorage architecture. Allow me to explain in detail what the difference is between a vPool and a Backend.

Backend

A backend is a collections of physical disks, devices or even backends. Next to grouping disks or backends it also defines how data is stored on its constituents. Parameters such as erasure coding/replication factor, compression, encryption need to be defined. Ordinarily a geoscale cluster will have multiple backends. While Eugene, the predecessor release of Fargo, only had 1 type of backend, there are now 2 types: a local and a global backend.

  • A local backend allows to group physical devices. This type is typically used to group disks within the same datacenter.
  • A Global backend allows to combine multiple (local) backends into a single (global) backend. This type of backend typically spans multiple datacenters.

Backends in practice

In each datacenter of an Open vStorage cluster there are multiple local backends. A typical segregation happens based upon the performance of the devices in the datacenter. An SSD backends will be created with devices which are fast and low latency and an HDD backend will be created with slow(er) devices which are optimised for capacity. In some cases the SSD or HDD backend will be split in more backends if they contain many devices for example by selecting every x-th disk of a node. This approach limits the impact of a node failure on a backend.
Note that there is no restriction for a local backend to only use disks within the same datacenter. It is perfectly possible to select disks from different datacenters and add them to the same backend. This doesn’t make sense of course for an SSD backend as the latency between the datacenters will be a performance limiting factor.
Another reason to create multiple backends is if you want to offer each customer his own set of physical disks for security or compliance reasons. In that case a backend is created per customer.

vPool

A vPool is a configuration template for vDisks, volumes being served by Open vStorage. This template contains a whole range of parameters such as blocksize to be used, SCO size on the backend, default write buffer size, preset to be used for data protection, hosts on which the volume can live, the backend where the data needs to be stored and whether data needs to be cached. These last 2 are particularly interesting as they express how different ALBA backends are tied together. When you create a vPool you select a backend to store the volume data. This can be a local backend, SSD for an all-flash experience or a global backend in case you want to spread data over multiple datacenters. This backend is used for every Storage Router serving the vPool. If you use a global backend across multiple datacenters, you will want to use some sort of caching in the local datacenter where the volume is running. Do this in order to keep the read latency as low as possible. To achieve this by assign a local SSD backend when extending a vPool to a certain Storage Router. All volumes being served by that Storage Router will on a read first check if the requested data is in the SSD backend. This means that Storage Routers in different datacenters will use a different cache backend. This approach allows to keep hot data in the local SSD cache and store cold data on the capacity backend which is distributed across datacenters. By using this approach Open vStorage can offer stunning performance while distributing the data across multiple datacenters for safety.

A final note

To summarise, an Open vStorage cluster can have multiple and different ALBA backends: local vs. global backends, SSD and HDD backends. vPools, a grouping of vDisks which share the same config, are the glue between these different backends.