Open vStorage High Availability (HA)

Last week I received an interesting question from a customer:

What about High-Availability (HA)? How does Open vStorage protect against failures?

This customer was right to ask that question. In case you run a large scale, multi-petabyte storage cluster, HA should be one of your key concerns. Downtime in such a cluster doesn’t only lead to production loss but might be a real PR disaster or even lead to foreclosure. When end-customers start leaving your service, it can become a slippery slope and before you are aware there is no customer left on your cluster. Hence, asking the HA question beforehand is a best practice for every storage engineer challenged with doing a due diligence of a new storage technology. Over the past few years we already devoted a lot of words to Open vStorage HA so I thought it was time for a summary.

In this blog post I will discuss the different HA scenarios starting from top (the edge) to bottom (the ASD).

The Edge

To start an Edge block device, you need to pass the IP and port of a Storage Router with the vPool of the vDisk. On initial connection the Storage Router will return to the Edge a list of fail-over Storage Routers. The Edge caches this information and switches automatically to another Storage Router in case it can’t communicate with the Storage Router for 15 seconds.
Periodically the Edge also asks the Storage Router to which Storage Router it should connect. This way the Storage Router can instruct the Edge to connect to another Storage Router, for example because the original Storage Router will be shut down.
For more details, check the following blog post about Edge HA.

The Storage Router

The Storage Router also has multiple HA features for the data path. As a vDisk can only be active and owned by a single Volume Driver, the block to object conversion process of the Storage Router, a mechanism is in place to make sure the ownership of the vDisks can be handed over (happy path) or stolen (unhappy path) by another Storage Router. Once the ownership is transferred the volume is started on the new Storage Router and IO requests can be processed. In case the old Storage Router would still try to write to the backend, fencing will kick in which prevents data to be stored on the backend.
The ALBA proxy is responsible for encrypting, compressing and erasure code the Storage Container Objects (SCOs) coming from the Volume Driver and sending the fragments to the ASD processes on the SSD/SATA disks. Each Storage Router also has multiple proxies and can switch between these proxies in cases of issues and timeouts.

The ALBA Backend

An ALBA backend typically consist out of a multiple physical disks across multiple servers. The proxies generate redundant parity fragments via erasure coding which are stored across all devices of the backend. As a result, a device or even a complete server failure doesn’t lead to data loss. On top, backends can be recursively composed. Let’s take as example the case where you have 3 data centers. One could create a (local) backend containing the disks of each data center and create a (global) backend on top of these these (local) backends. Data could for example be replicated 3 times, one copy in each data center, and erasure coded within the data center for storage efficiency. Using this approach a data center outage wouldn’t cause any data loss.

The management path HA

The previous sections of this blog post discussed the HA features of the data path. The management path is also high available. The GUI and API can be reached from all master nodes in the cluster. The metadata is also stored redundantly and is spread across multiple nodes or even data centers. Open vStorage has 2 types of metadata: the volume metadata and the backend metadata. The volume metadata is stored in a networked RocksDB using a master-slave concept. More information about that can be found here and in a video here.
The backend metadata is stored in our own, in-house developed, always consistent key-value store named Arakoon. More info on Arakoon can be found here.

That’s in a nutshell how Open vStorage makes sure a disk, server or data center disaster doesn’t lead to storage downtime.

Edge: HA, failure and the moving of volumes explained

edge HA FailoverOpen vStorage is designed to be rock solid and survive failures. These failures can come in many forms and shapes: nodes might die, network connections might get interrupted, … Let’s give an overview of the different tactics that are used by Open vStorage when disaster strikes by going over some possible use cases where the new edge plays a role.

Use case 1: A hypervisor fails

In case the hypervisor fails, the hypervisor management (OpenStack, vCenter, …) will detect the failure and restart the VM on another hypervisor. Since the VM is started on another hypervisor, the VM will talk to the edge client on the new hypervisor. The edge client will connect to a volume driver in the vPool and enquire which volume driver owns the disks of the VM. The volume driver responds who is the owner and the edge connects to the volume driver owning the volume. This all happens almost instantaneously and in the background so the the IO of the VM isn’t affected.

Use case 2: A Storage Router fails

In case a Storage Router and hence the volume driver on it die, the edge client automatically detects that the connection to the volume driver is lost. Luckily the edge keeps a list of volume drivers which also serve the vPool and it connects to one of the remaining volume drivers in the vPool. It is clear that the edge prefers to fail-over to a volume driver which is close-by f.e. within the same datacenter. The new volume driver to which the edge connects detects that it isn’t the owner of the volume. As the old volume driver is no longer online, the new volume driver steals the ownership of the VMs volume. Stealing is allowed in this case as the old volume driver is down. Once the new volume driver becomes the owner of the volumes, the edge client can start serving IO. This whole process process happens in the background and halts the IO of the VM for a fraction of a second.

Use case 3: Network issues

In some exceptional cases it isn’t the hypervisor or the storage router that fails but the network in between. This is an administrator’s worst nightmare as it might lead to split brain scenarios. Even in this case the edge is able to outlive the disaster. As the network connection between the edge and the volume driver is lost, the edge will assume the volume driver is dead. Hence, as in use case 2 the edge connects to another volume driver in the same vPool. The volume driver first tries to contact the old volume driver.

Now there are 2 options:

  • The new volume driver can contact the old volume driver. After some IO is exchanged the new volume driver asks the old volume driver to hand over the volume. This handover doesn’t impact the edge.
  • The new volume driver can also not contact the old volume driver. In that case old volume driver steals the volume from the old volume driver. It does this by updating the ownership of the volume in the distributed DB and by uploading a new key to the backend. As the ALBA backend uses a conditional write approach, it only writes the IO to disks of the backend if the accompanying key is valid, it can ensure only the new volume driver is allowed to write to the backend. If the old volume driver would still be online (split brain) and try to update the backend, the write would fail as it is using an outdated key.

The Distributed Transaction Log explained

During my 1-on-1 sessions I quite often get the question how Open vStorage makes sure there is no data loss when a host crashes. As you probably already know Open vStorage uses SSDs and PCIe flash cards inside the host where the VM is running to store incoming writes. All incoming writes for a volume get appended to a log file (SCO, Storage Container Object) and once enough write are accumulated the SCO gets stored on the backend. Once the SCO is on the backend Open vStorage relies on the functionality (erasure coding, 3-way replication, …) of the backend to make sure that data is stored safely.

This means there is window where data is vulnerable, when the SCO is being constructed and not yet stored on the Backend. To ensure the vulnerable data isn’t lost when a host crashes, incoming writes are also stored in the Distributed Transaction Log (DTL) on another host in the Open vStorage cluster. Note that the volume can even be restarted on another host than were the DTL was stored.

For the DTL of volume you can select one of the following options as modus operandi:

  • No DTL: when this option is selected incoming data doesn’t get stored in the DTL on another node. This option can be used when performance is key and some data loss is acceptable when the host or storage router goes down. Test VMs or VMs which are running batch or distributed applications (f.e. transcoding of files to another file) can use this option.
  • Asynchronous: when this option is selected the incoming writes are added to a queue on the host and replicated to the DTL on the other host once the queue reaches a certain size or if a certain time is exceeded. To ensure consistency, all outstanding data is synced to the DTL in case a sync is executed within the file system of the VM. Virtual Machines running on KVM can use this option. This mode balances data safety and performance.
  • DTL - async

  • Synchronous: when this option is selected, every write request gets synchronized to the DTL on the other host. This option should be selected when absolutely no data loss is acceptable (distributed NFS, HA iSCSI disks). Since this options synchronizes on every write, it is the slowest mode of the DTL. Note that in case the DTL can’t be reached (f.e. because the host is being rebooted), the incoming I/O isn’t blocked and doesn’t return an I/O error to the VM but an out-of-band event is generated to restart the DTL on another host.
  • DTL - sync

vMotion, Storage Router Teamwork

Important note: this blog posts talks about vMotion, a VMware feature. KVM fans should not be disappointed as Live Migration, the KVM version of vMotion is also supported by Open vStorage. We use the term vMotion as it is the most used term for this feature by the general IT public.

In a previous blogpost we explained why Open vStorage is different. One thing we do differently is not implementing a distributed file system. This sparked the interest of a lot of people but also raised questions for more clarification. Especially more information on how we pulled off vMotion without a distributed file system or expensive SAN raised a lot of fascination. Time for a blog post to explain how it all works under the hood.

Normal behavior

Under normal circumstances a volume, a disk of a Virtual Machine, can be seen by all hosts in the Open vStorage Cluster as it is a file on the vPool (a datastore in VMware) but the underlying, internal object (Volume Driver volume) is owned by a single host and can only be accessed by this single host. Each host can see the whole content of the datastore as each NFS and Fuse instance shows all the files on the datastore. This means the hosts believe they are using shared storage. But in reality only the metadata of the datastore is shared between all hosts but the actual data is not shared at all. To share the metadata across hosts a distributed database is used. To keep track of which host is ‘owning’ the volume and hence can access the data, we use an Object Registry which is implemented on top of a distributed database. The technology which tricks hosts in believing they are using shared storage while only one host really has access to the data is the core Open vStorage technology. This core technology consists out of 3 components which are available on all hosts with a Storage Router:
* The Object Router
* The Volume Driver
* The File Driver

The Object Router
The Object Router is the component underneath the NFS (VMware) and the FUSE (KVM) layer and dispatches requests for data to the correct core component. For each write the Object Router will check if it is the owner of the file on the datastore. In case the Object Router is the owner of the file it will hand off the data to underlying File or Volume Driver on the same Storage Router. Otherwise the Object Router will check in the Object Registry, stored in the distributed database, which Object Router owns the file and forwards the data to that Object Router. The same process is followed for read requests.

The Volume Driver
All the read and write requests for an actual volume (a flat-VMDK or raw file) are handled by the Volume Driver. This component is responsible for turning a Storage Backend into a block device. This is also the component which takes care of all the caching. Data which is no longer needed is sent to the backend to make room for new data in the cache. In case data is not in the cache but requested by the Virtual Machine, the Volume Driver will get the needed data from the backend. Note that a single volume is represented by a single bucket on the Storage Backend. It is important to see that only 1 Volume Driver will do the communication with the Storage Backend for a single volume.

The File Driver
The File Driver is responsible for all non volume files (vm config files, …). The File Driver stores the actual content of these files on the Storage Backend. Each small file is represented by a single file or key/value pair on the Storage Backend. In case a file is bigger than 1MB, it is split in smaller pieces to improve performance. All the non-volume files for a single datastore end up in a single, shared bucket. It is important to see that only 1 File Driver will do the communication with the Storage Backend for a file in the datastore.

Open vStorage - normal

vMotion Step 1

When a Virtual Machine is moved between hosts, vMotioned, vCenter calls the shots. In a first step vCenter will kick off the vMotion process as none of the hosts involved will complain as they believe they are using shared storage. As under normal vMotion behavior, the memory of the Virtual Machine will be copied to the destination host while the source VM continues to run (so no interruption for end-users there). Once the memory is almost completely copied the Virtual Machine is quiesced, the Virtual Machine state is transferred, the missing pieces of the memory are copied and the Virtual Machine is resumed on the destination. As for vMotion both hosts have access to the VMDK files, there is no special action needed on the storage level. But with Open vStorage the volumes of the Virtual Machine are not really shared between the hosts, remember the Object Router of the source host is the owner of the volumes. Open vStorage must tackle this when read or write requests happen. In case a write happens to the volume of the moved Virtual Machine, the Object Router on the destination host will see that it is not the owner of the volume. The destination Object Router will check in the Object Registry which Object Router owns the volumes and will forward the write requests to that Object Router. The Object Router on the source forwards the write to the Volume Driver on the source as under normal behavior. The same happens for read requests. To summarize, in a first step only the Virtual Machine is moved to the destination while the volumes of the Virtual Machine are still being served by the source Storage Router.

Open vStorage - vMotion 1

vMotion Step 2

After the first step of the vMotion process, the volumes of the Virtual Machine are still being owned and served by the Object Router of the source. This is of course a situation which can’t be sustained in case a lot of IO occurs on the volumes of the Virtual Machine. Once an IO threshold is passed, the Object Router of the destination will start negotiating with the Object Router on the source to hand over the volumes. Just as with the memory, the metadata of the volumes gets assembled in the Volume Driver at the destination. Once this process is complete a point in time is arranged to copy the last metadata. To complete the process the Source Object Router marks the volumes as owned by destination Object Router and from then on the volumes are served by the destination Object Router.

Open vStorage - vMotion 2

Summary

vMotion is supported by Open vStorage although a volume can only by written and read by a single host. In a first step vCenter will move the Virtual Machine to the destination host but the volumes of the Virtual Machine will still be served on the source hosts. This means that communication between the Object Routers on the 2 hosts is required for all IO traffic to the volumes. In a second phase, after an IO threshold is passed, the Object Routers will negotiate and agree to make the Object Router of the destination the owner of the volumes. Only after this second phase the whole Virtual Machine, both compute and disks, is running on the destination host.

Open vStorage 1.1

The Open vStorage team is on fire. We released a new version of the Open vStorage software (select Test as QualityLevel). The new big features for this release are:

  • Logging and log collecting: we have added logging to Open vStorage. The logs are also centrally gathered and stored in a distributed Search Engine (Elasticsearch). The logs can be viewed, browsed, searched and analyzed through Kibina, a very nice designed GUI.
  • HA Support: Open vStorage does with this release not only support vMotion but also the HA functionality of VMware. This means that in case an ESXi Host dies and vCenter starts the VMs on another Host, the volumes will automatically be migrated along.

Some small feature that got into this release:

  • Distributed filesystem can now be selected as Storage Backend in the GUI. Earlier you could select the file system but you could not extend it to more Grid Storage Routers (GSR). Now you can extend it across more GSRs and do vMotion on top of f.e. GlusterFS.
  • The status of VMs on KVM is now updated quicker in the GUI.
  • Manager.py has now an option to specify the version you want to install (run manager.py -v ‘version number’)
  • Under administration there is now an About Open vStorage page displaying the version of all installed components.

In addition, the team also fixed 35 bugs.