Edge: HA, failure and the moving of volumes explained

edge HA FailoverOpen vStorage is designed to be rock solid and survive failures. These failures can come in many forms and shapes: nodes might die, network connections might get interrupted, … Let’s give an overview of the different tactics that are used by Open vStorage when disaster strikes by going over some possible use cases where the new edge plays a role.

Use case 1: A hypervisor fails

In case the hypervisor fails, the hypervisor management (OpenStack, vCenter, …) will detect the failure and restart the VM on another hypervisor. Since the VM is started on another hypervisor, the VM will talk to the edge client on the new hypervisor. The edge client will connect to a volume driver in the vPool and enquire which volume driver owns the disks of the VM. The volume driver responds who is the owner and the edge connects to the volume driver owning the volume. This all happens almost instantaneously and in the background so the the IO of the VM isn’t affected.

Use case 2: A Storage Router fails

In case a Storage Router and hence the volume driver on it die, the edge client automatically detects that the connection to the volume driver is lost. Luckily the edge keeps a list of volume drivers which also serve the vPool and it connects to one of the remaining volume drivers in the vPool. It is clear that the edge prefers to fail-over to a volume driver which is close-by f.e. within the same datacenter. The new volume driver to which the edge connects detects that it isn’t the owner of the volume. As the old volume driver is no longer online, the new volume driver steals the ownership of the VMs volume. Stealing is allowed in this case as the old volume driver is down. Once the new volume driver becomes the owner of the volumes, the edge client can start serving IO. This whole process process happens in the background and halts the IO of the VM for a fraction of a second.

Use case 3: Network issues

In some exceptional cases it isn’t the hypervisor or the storage router that fails but the network in between. This is an administrator’s worst nightmare as it might lead to split brain scenarios. Even in this case the edge is able to outlive the disaster. As the network connection between the edge and the volume driver is lost, the edge will assume the volume driver is dead. Hence, as in use case 2 the edge connects to another volume driver in the same vPool. The volume driver first tries to contact the old volume driver.

Now there are 2 options:

  • The new volume driver can contact the old volume driver. After some IO is exchanged the new volume driver asks the old volume driver to hand over the volume. This handover doesn’t impact the edge.
  • The new volume driver can also not contact the old volume driver. In that case old volume driver steals the volume from the old volume driver. It does this by updating the ownership of the volume in the distributed DB and by uploading a new key to the backend. As the ALBA backend uses a conditional write approach, it only writes the IO to disks of the backend if the accompanying key is valid, it can ensure only the new volume driver is allowed to write to the backend. If the old volume driver would still be online (split brain) and try to update the backend, the write would fail as it is using an outdated key.

Domains and Recovery Domains

In the Fargo release we introduced a new concept: Domains. In this blog post you can find a description of what Domains exactly are and why and how you should configure them.

A Domain is a logical grouping of Storage Routers. You can compare a domain to an availability zone in OpenStack or a region in AWS. A Domain typically group Storage Routers which can fail for a common reason f.e. because they are on the same power feed or within the same datacenter.

Open vStorage can survive a node failure without any data loss for the VMs on that node. Even data in the write buffer which isn’t on the backend yet is safeguarded on another node by the Distributed Transaction Log. The key element in having no data loss is that the node running the volume and the node running the DTL should not be down at the same time. To limit the risk of both being down at the same time, you should make sure the the DTL is on a node which is not on the same rack or on the same power feed. The Open vStorage can of course not detect which servers are in the same rack so it is up to the user to define different Domains and assign Storage Routers to them.

As a first step create the different Domains in the Administration section (Administration > Domains). You are free to select how you want to group the Storage Routers. A few possible examples are per rack, power feed or even per datacenter, … . In the below example we have grouped the Storage Routers per datacenter.

domains

Next, go to the detail page of each Storage Router and click the edit button.

storage router

Select the Domain, where the actual volumes is hosted, and optionally select a Recovery Domain. In case the Recovery Domain is empty, the DTL will be located in the Domain of the Storage Router. In case a Recovery Domain is selected, it will host the DTL for volumes being served by that Storage Router. Note that you can only assign a Domain as Recovery Domain if at least a single Storage Router is using it as Domain. To make sure that the latency of the DTL doesn’t become a bottleneck for the write IO it strongly advised to have a low latency network between the Storage Routers in the Domain and the Recovery Domain.

Another area where Domains play a role is the location of the MetaDataServer (MDS). The master and a slave MDS will always be located in the Domain of the Storage Router.
In case you configure a Recovery Domain, a MDS slave will also be located on one of the hosts of the Recovery Domain. This additional slave will make sure there is only a limited metadata rebuild necessary to bring the volume live.