Fargo: the updated Open vStorage Architecture

With the Fargo release of Open vStorage we are focussing even more on the Open vStorage sweet spot: multi-petabyte, multi-datacenter storage clusters which offer super-fast block storage.
In order to achieve this we had to significantly change the architecture for the Fargo release. Eugene, the version before Fargo, already had the Shared Memory Server (SHM) in its code base but its wasn’t activated by default. The Fargo release now primarily uses the SHM approach. To make even more use of it, we created the Open vStorage Edge. The Edge is a lightweight block storage driver which can be installed on Linux servers (hosts running the hypervisor or inside the VM) and talks across the network to the Shared Memory of a remote Volume Driver. Both TCP/IP and the low latency RDMA protocol can be used to connect the Edge with the Volume Driver. Northbound the Edge has an iSCSI, Blktap and QEMU interface. Additional interfaces such as iSER and FCoE are planned. Next to the new Edge interface, the slower Virtual Machine interface which exposes a Virtual File System (NFS, FUSE), is still supported.

Architecture

The Volume Driver has also been optimized for performance. The locks in the write path have been revised in order to minimize their impact. More radical is the decision to remove the deduplication functionality from the Volume Driver in order to keep the size of the metadata of the volumes to a strict minimum. By removing the bytes reserved for the hash, we are capable of keeping all the metadata in RAM and push the performance across 1 million IOPS per host on decent hardware. For those who absolutely need deduplication there is still a version available of the Volume Driver which has support for deduplication.

With the breakthrough of RDMA, the network bottleneck is removed and network latency is brought down to a couple of microseconds. Open vStorage makes use of the possibilities RDMA offers to implement a shared cache layer. To achieve this it is now possible to create an ALBA backend out of NVMe or SSD devices. This layer acts as a local, within a single datacenter, cache layer in front of an SATA ALBA backend, the capacity tier, which is spread across multiple datacenters.
This means all SSDs in a single datacenter devise a shared cache for the data of that datacenter. This minimizes the impact of an SSD failure and removes the cold cache effect when moving a volume between hosts. In order to minimize the impact of a single disk failure we introduced the NC-ECC (Network and Clustered Error Correction Codes) algorithm. This algorithm can be compared with solving a Sudoku puzzle. Each SCO, a collection of consecutive writes, is chopped up in chunks. All these chunks are distributed across all the nodes and datacenters in the cluster. The total amount of chunks can be configured but allows for example to recover from a multi node failure or a complete datacenter loss. A failure, whether it is a disk, node or datacenter will cross out some numbers from the complete Sudoku puzzle but as long as you have enough numbers left, you can still solve the puzzle. The same goes for data stored with Open vStorage: as long as you have enough chunks (disk, nodes or datacenters) left, you can always recover the data. The NC-ECC algorithm is based on forward error correction codes and is further optimized for usage within a multi-datacenter approach. When there is a disk or node failure, additional chunks will be created using only data from within the same datacenter. This ensures the bandwidth between datacenters isn’t stressed in case of a simple disk failure.

By splitting up the Edge, the Volume Driver, the cache layer and the capacity tier, you have the ultimate flexibility to build the storage cluster of your needs. You can run everything on the same server, hyperconverged, or you can install each component on a dedicated server to maximize scalability and performance.

The first alpha version of Fargo is now available on the repo.

Domains and Recovery Domains

In the Fargo release we introduced a new concept: Domains. In this blog post you can find a description of what Domains exactly are and why and how you should configure them.

A Domain is a logical grouping of Storage Routers. You can compare a domain to an availability zone in OpenStack or a region in AWS. A Domain typically group Storage Routers which can fail for a common reason f.e. because they are on the same power feed or within the same datacenter.

Open vStorage can survive a node failure without any data loss for the VMs on that node. Even data in the write buffer which isn’t on the backend yet is safeguarded on another node by the Distributed Transaction Log. The key element in having no data loss is that the node running the volume and the node running the DTL should not be down at the same time. To limit the risk of both being down at the same time, you should make sure the the DTL is on a node which is not on the same rack or on the same power feed. The Open vStorage can of course not detect which servers are in the same rack so it is up to the user to define different Domains and assign Storage Routers to them.

As a first step create the different Domains in the Administration section (Administration > Domains). You are free to select how you want to group the Storage Routers. A few possible examples are per rack, power feed or even per datacenter, … . In the below example we have grouped the Storage Routers per datacenter.

domains

Next, go to the detail page of each Storage Router and click the edit button.

storage router

Select the Domain, where the actual volumes is hosted, and optionally select a Recovery Domain. In case the Recovery Domain is empty, the DTL will be located in the Domain of the Storage Router. In case a Recovery Domain is selected, it will host the DTL for volumes being served by that Storage Router. Note that you can only assign a Domain as Recovery Domain if at least a single Storage Router is using it as Domain. To make sure that the latency of the DTL doesn’t become a bottleneck for the write IO it strongly advised to have a low latency network between the Storage Routers in the Domain and the Recovery Domain.

Another area where Domains play a role is the location of the MetaDataServer (MDS). The master and a slave MDS will always be located in the Domain of the Storage Router.
In case you configure a Recovery Domain, a MDS slave will also be located on one of the hosts of the Recovery Domain. This additional slave will make sure there is only a limited metadata rebuild necessary to bring the volume live.

Performance Tuning

At Open vStorage we now have various large clusters which can easily deliver multiple millions of IOPS. For some customers it is even a prestige project to produce the highest amount of IOPS on their Open vStorage dashboard. Out of the box Open vStorage will already give you very decent performance but there a few nuts and bolts you can tweak to increase the performance of your environment. There is no golden rule to increase the performance but below we share some tips and pointers:

vDisk Settings
The most obvious way to influence the IO performance of a vDisk is by selecting the appropriate settings in the vDisk detail page.The impact of the DTL setting was already covered in a previous blog post so we will skip it here.
Deduplication has also an impact on the write IO performance of the vDisk. In case you know the data isn’t suited for deduplication then don’t turn it on. As we have large read caches, we only set the dedupe feature on for OS disks.
Another setting we typically set at the vPool level is the SCO size. To increase the write performance you typically want to select a large Storage Container Object (SCO) size as to minimize the overhead of the creation and closing of a SCO. Also, backends are typically very good at writing large chunks of sequential data so a big SCO size makes sense. But as usual there is a trade-off. With traditional backends like Swift, Ceph or any other object store for that matter, you need to retrieve the whole SCO from the backend in case of a cache miss. A bigger SCO means in that case more read latency in case of a cache miss. This is one of the reasons why designed our own backend, ALBA. With ALBA you can retrieve a part of a SCO from the backend. Instead of getting a 64MiB SCO, we can get the exact 4k we need from the SCO. ALBA is the only object storage that currently supports this functionality. In large clusters with ALBA as backend we typically set 64 MiB as SCO size. In case you don’t use ALBA, use a lower SCO size.

Optimize the Backends
One of the more less obvious items which can make a huge difference in performance is the right preset. A preset consists out of a set of policies, a compression method (optional) and whether encryption should be activated.
You might ask why tuning the backend might influence the performance on the front-end towards the VM. The performance of the backend will for example influence the read performance in case of a cache miss. Also on writes the backend might become the bottleneck for incoming data. All writes go into the write buffer which is typically sized to contain a couple of SCOs. This is ok in case your backend is fast enough as once a SCO is full, it is ready to be saved on the backend and removed from the write buffer. This way it can make room for newly written data. In case the backend is too slow to manage what comes out of the write buffer, Open vStorage will start throttling the ingest of the data on the frontend. So it is essential to have a look at your backend performance in case it is the bottleneck for the write performance.
Since we typically set the SCO size to 64MiB and think a fragment size of 4MiB is a good size for fragments, we change the policy to have 16 data fragments. The other parameters are depending on the reliability and the amount of hosts used for storage.
Compression is typically turned on but gives distorted results when running your typical random IO benchmark as random data is hard to compress. Data which is hard to compress will even be bigger in size and hence take more time to store. Basically, if you are running benchmarks with random IO it is best to turn compression off.

In case you need help in tweaking the performance of your environment, feel free to contact us.

The Game of Distributed Systems Programming. Which Level Are You?

(originally published on the incubaid.com blog, 2012/03/28)

Introduction

When programming distributed systems becomes part of your life, you go through a learning curve. This article tries to describe my current level of understanding of the field, and hopefully points out enough mistakes for you to be able follow the most optimal path to enlightenment: learning from the mistakes of others.
For the record: I entered Level 1 in 1995, and I’m currently Level 3. Where do you see yourself?

Level 0: Clueless

Every programmer starts here. I will not comment too much here as there isn’t a lot to say. Instead, I quote some conversations I had, and offer some words of advice to developers that never battled distributed systems.

NN1:”replication in distributed systems is easy, you just let all the machines store the item at the same time

Another conversation (from the back of my memory):

NN: “For our first person shooter, we’re going to write our own networking engine”
ME: “Why?”
NN: “There are good commercial engines, but license costs are expensive and we don’t want to pay these.”
ME: “Do you have any experience in distributed systems?”
NN: “Yes, I’ve written a socket server before.”
ME: “How long do you think you will take to write it?”
NN: “I think 2 weeks. Just to be really safe we planned 4.”

Sometimes it’s better to remain silent.

Level 1: RPC

RMI is a very powerful technique for building large systems. The fact that the technique can be described, along with a working example, in just a few pages, speaks volumes of Java. RMI is tremendously exciting and it’s simple to use. You can call to any server you can bind to, and you can build networks of distributed objects. RMI opens the door to software systems that were formerly too complex to build.

Peter van der Linden, Just Java (4th edition, Sun Microsystems)

Let me start by saying I’m not dissing this book. I remember disctinctly it was fun to read (especially the anecdotes between the chapters), and I used it for the Java lessons I used to give (In a different universe, a long time ago). In general, I think well of it. His attitude towards RMI however, is typical of Level 1 distributed application design. People that reside here share the vision of unified objects. In fact, Waldo et al describe it in detail in their landmark paper “a note on distributed computing” (1994), but I will summarize here:
The advocated strategy to writing distributed applications is a three phase approach. The first phase is to write the application without worrying about where objects are located and how their communication is implemented. The second phase is to tune performance by “concretizing” object locations and communication methods. The final phase is to test with “real bullets” (partitioned networks, machines going down, …).

The idea is that whether a call is local or remote has no impact on the correctness of a program.

The same paper then disects this further and shows the problems with it. It has thus been known for almost 20 years that this concept is wrong. Anyway, if Java RMI achieved one thing, it’s this: Even if you remove transport protocol, naming and binding and serialization from the equation, it still doesn’t work. People old enough to rember the hell called CORBA will also remember it didn’t work, but they have an excuse: they were still battling all kinds of lower level problems. Java RMI took all of these away and made the remaining issues stick out. There are two of them. The first is a mere annoyance:

Network Transparency isn’t

Let’s take a look at a simple Java RMI example (taken from the same ‘Just Java’)

[code language=”java”]
public interface WeatherIntf extends javva.rmi.Remote{
public String getWeather() throws java.rmi.RemoteException;
}

[/code]

A client that wants to use the weather service needs to do something like this:

[code language=”java”]
try{
Remote robj = Naming.lookup("//localhost/WeatherServer");
WeatherIntf weatherserver = (WeatherInf) robj;
String forecast = weatherserver.getWeather();
System.out.println("The weather will be " + forecast);
}catch(Exception e){
System.out.println(e.getMessage());
}
[/code]

The client code needs to take RemoteExceptions into account.
If you want to see what kinds of remote failure you can encounter, take a look at the more than 20 subclasses. Ok, so your code will be a tad less pretty. We can live with that.

Partial Failure

The real problem with RMI is that the call can fail partially. It can fail before the action on the other tier is invoked, or the invocation might succeed but the return value might not make it afterwards, for whatever reason. These failure modes are in fact the very defining property of distributed systems or otherwise stated:

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”
(Leslie Lamport)

If the method is just the retrieval of a weather forecast, you can simply retry, but if you were trying to increment a counter, retrying can have results ranging from 0 to 2 updates. The solution is supposed to come from idempotent actions, but building those isn’t always possible. Moreover, since you decided on a semantic change of your method call, you basically admit RMI is different from a local invocation. This is an admission of RMI being a fallacy.

In any case the paradigm is a failure as both network transparency and architectural abstraction from distribution just never materialise. It also turns out that some software methodologies are more affected than others. Some variations of scrum tend to prototype. Prototypes concentrate on the happy path and the happy path is not the problem. It basically means you will never escape Level 1. (sorry, this was a low blow. I know)

People who do escape Level 1 understand they need to address the problem with the respect it deserves. They abandon the idea of network transparency, and attack the handling of partial failure strategically.

Level 2: Distributed Algorithms + Asynchronous messaging + Language support

<sarcasm>”Just What We Need: Another RPC Package” </sarcasm>
(Steve Vinoski)

Ok, you’ve learned the fallacies of distributed computing. You decided to bite the bullet, and model the message passing explicitly to get a control of failure.
You split your application into 2 layers, the bottom being responsible for networking and message transport, while the upper layer deals with the arrival of messages, and what needs to be done when they do.
The upper layer implements a distributed state machine, and if you ask the designers what it does, they will tell you something like : “It’s a multi-paxos implementation on top of TCP”.
Development-wise, the strategy boils down to this: Programmers first develop the application centrally using threads to simulate the different processes. Each thread runs a part of the distributed state machine, and basically is responsible for running a message handling loop. Once the application is locally complete and correct, the threads are taken away to become real processes on remote computers. At this stage, in the absence of network problems, the distributed application is already working correctly. In a second phase fault tolerance can be straighforwardly achieved by configuring each of the distributed entities to react correctly to failures (I liberally quoted from “A Fault Tolerant Abstraction for Transparent Distributed Programming).

Partial failure is handled by design, because of the distributed state machine. With regards to threads, there are a lot of options, but you prefer coroutines (they are called fibers, Light weight threads, microthreads, protothreads or just theads in various programming languages, causing a Babylonic confusion) as they allow for fine grained concurrency control.

Combined with the insight that “C ain’t gonna make my network any faster”, you move to programming languages that support this kind of fine grained concurrency.
Popular choices are (in arbitrary order)

(Note how they tend to be functional in nature)

As an example, let’s see what such code looks like in Erlang (taken from Erlang concurrent programming)

[code language=”erlang”]
-module(tut15).

-export([start/0, ping/2, pong/0]).

ping(0, Pong_PID) -&gt;
Pong_PID ! finished,
io:format("ping finished~n", []);

ping(N, Pong_PID) -&gt;
Pong_PID ! {ping, self()},
receive
pong -&gt;
io:format("Ping received pong~n", [])
end,
ping(N – 1, Pong_PID).

pong() -&gt;
receive
finished -&gt;
io:format("Pong finished~n", []);
{ping, Ping_PID} -&gt;
io:format("Pong received ping~n", []),
Ping_PID ! pong,
pong()
end.

start() -&gt;
Pong_PID = spawn(tut15, pong, []),
spawn(tut15, ping, [3, Pong_PID]).
[/code]

This definitely looks like a major improvement over plain old RPC. You can start reasoning over what would happen if a message doesn’t arrive.
Erlang gets bonus points for having Timeout messages and a builtin after Timeout construct that lets you model and react to timeouts in an elegant manner.

So, you picked your strategy, your distributed algorithm, your programming language and start the work. You’re confident you will slay this monster once and for all, as you ain’t no Level 1 wuss anymore.

Alas, somewhere down the road, some time after your first releases, you enter troubled waters. People tell you your distributed application has issues. The reports are all variations on a theme. They start with a frequency indicator like “sometimes” or “once”, and then describe a situation where the system is stuck in an undesirable state. If you’re lucky, you had adequate logging in place and start inspecting the logs. A little later, you discover an unfortunate sequence of events that produced the reported situation. Indeed, it was a new case. You never took this into consideration, and it never appeared during the extensive testing and simulation you did. So you change the code to take this case into account too.

Since you try to think ahead, you decide to build a monkey that pseudo randomly lets your distributed system do silly things. The monkey rattles its cage and quickly you discover a multitude of scenarios that all lead to undesirable situations like being stuck (never reaching consensus) or even worse: reaching an inconsistent state that should never occur.

Having a monkey was a great idea, and it certainly reduces the chance of encountering something you’ve never seen before in the field. Since you believe that a bugfix goes hand in hand with a testcase that first produced the bug, and now proves its demise, you set out to build just that test. Your problem however is reproducing the failure scenario is difficult, if not impossible. You listen to the gods as they hinted when in doubt, use brute force. So you produce a tests that runs a zillion times to compensate the small probability of the failure. This makes your bug fixing process slow and your test suites bulky. You compensate again by doing divide and conquer on your volume of testsets. Anyway, after a heavy investment of effort and time, you somehow manage to get a rather stable system and ditto process.

You’re maxed out on Level 2. Without new insights, you’ll be stuck here forever.

Level 3: Distributed Algorithms + Asynchronous messaging + Purity

It takes a while to realise that a combination of long running monkeys to discover evil scenarios and brute force to reproduce them ain’t making it. Using brute force just demonstrates ignorance. One of the key insights you need is that if you could only remove indeterminism from the equation, you would have perfect reproducibility of every scenario. A major side effect of Level 2 distributed programming is that your concurrency model tends to go viral on your codebase. You desired fine grained concurrency control… well you got it. It’s everywhere. So concurrency causes indeterminism and indeterminism causes trouble. So concurrency must go. You can’t abandon it: you need it. You just have to ban it from mingling with your distributed state machine. In other words, your distributed state machine has to become a pure function. No IO, No Concurrency, no nothing. Your state machine signature will look something like this

[code language=”fsharp”]
module type SM = sig
type state
type action
type msg
val step: msg -&gt; state -&gt; action * state
end
[/code]

You pass in a message and a state, and you get an action and a resulting state. An action is basically anything that tries to change the outside world, needs time to do so, and might fail while trying. Typical actions are

  • send a message
  • schedule a timeout
  • store something in persistent storage

The important thing to realise here is that you can only get to a new state via a new message. nothing else. The benefits of such a strict regime are legio. Perfect control, perfect reproducibility and perfect tracibility. The costs are there too. You’re forced to reify all your actions, which basically is an extra level of indirection to reduce your complexity. You also have to model every change of the outside world that needs your attention into a message.

Another change from Level 2 is the change in control flow. At Level 2, a client will try to force an update and set the machinery in motion. Here, the distributed state machine assumes full control, and will only consider a client’s request when it is ready and able to do something useful with it. So these must be detached.

If you explain this to a Level 2 architect, (s)he will more or less accept this as an alternative. It, however, takes a sufficient amount of pain (let’s call it experience or XP) to realize it’s the only feasible alternative.

Level 4: Solid domination of distributed systems: happiness, piece of mind and a good night’s rest

To be honest, as I’m a mere Level 3 myself, I don’t know what’s up here. I am convinced that both functional programming and asynchronous message passing are parts of the puzzle, but it’s not enough.
Allow me to reiterate what I’m struggling against. First, I want my distributed algorithm implementation to fully cover all possible cases.
This is a big deal to me as I’ve lost lots of sleep being called in on issues in deployed systems (Most of these turn out to be PEBKAC but some were genuine, and cause frustration). It would be great to know your implementation is robust. Should I try theorem provers, should I do exhaustive testing ? I don’t know.
As an aside, for an append only btreeish library called baardskeerder, we know we covered all cases by exhaustively generating insert/delete permutations and asserting their correctness. Here, it’s not that simple, and I’m a bit hesitant to Coqify the codebase.
Second, for reasons of clarity and simplicity, I decided not to touch other, orthogonal requirements like service discovery, authentication, authorization, privacy and performance.
With regard to performance, we might be lucky as the asynchronuous message passing at least doesn’t seem to contradict performance considerations.
Security however is a real bitch as it crosscuts almost everything else you do. Some people think security is a sauce that you can pour over your application to make it secure.
Alas, I never succeeded in this, and currently think it also needs to be addressed strategically during the very first stages of design.

Closing words

Developing robust distributed systems is a difficult problem that is practically unsolved, or at least not solved to my satisfaction.
I’m sure its importance will increase significantly as latency between processors and everything else increases too. This results in an ever growing area of application for this type of application development.

As far as Level 4 goes, maybe I should ask Peter Van Roy. Over the years, I’ve read a lot of his papers, and they offered me a lot of insight in my own mistakes. The downside of insight is that you see others repeating your mistakes and most of the time, I fail to convince people they should do it differently.
Probably, this is because I cannot offer the panacea they want. They want RPC and they want it to work. It’s perverse … almost religious