Performance Tuning

At Open vStorage we now have various large clusters which can easily deliver multiple millions of IOPS. For some customers it is even a prestige project to produce the highest amount of IOPS on their Open vStorage dashboard. Out of the box Open vStorage will already give you very decent performance but there a few nuts and bolts you can tweak to increase the performance of your environment. There is no golden rule to increase the performance but below we share some tips and pointers:

vDisk Settings
The most obvious way to influence the IO performance of a vDisk is by selecting the appropriate settings in the vDisk detail page.The impact of the DTL setting was already covered in a previous blog post so we will skip it here.
Deduplication has also an impact on the write IO performance of the vDisk. In case you know the data isn’t suited for deduplication then don’t turn it on. As we have large read caches, we only set the dedupe feature on for OS disks.
Another setting we typically set at the vPool level is the SCO size. To increase the write performance you typically want to select a large Storage Container Object (SCO) size as to minimize the overhead of the creation and closing of a SCO. Also, backends are typically very good at writing large chunks of sequential data so a big SCO size makes sense. But as usual there is a trade-off. With traditional backends like Swift, Ceph or any other object store for that matter, you need to retrieve the whole SCO from the backend in case of a cache miss. A bigger SCO means in that case more read latency in case of a cache miss. This is one of the reasons why designed our own backend, ALBA. With ALBA you can retrieve a part of a SCO from the backend. Instead of getting a 64MiB SCO, we can get the exact 4k we need from the SCO. ALBA is the only object storage that currently supports this functionality. In large clusters with ALBA as backend we typically set 64 MiB as SCO size. In case you don’t use ALBA, use a lower SCO size.

Optimize the Backends
One of the more less obvious items which can make a huge difference in performance is the right preset. A preset consists out of a set of policies, a compression method (optional) and whether encryption should be activated.
You might ask why tuning the backend might influence the performance on the front-end towards the VM. The performance of the backend will for example influence the read performance in case of a cache miss. Also on writes the backend might become the bottleneck for incoming data. All writes go into the write buffer which is typically sized to contain a couple of SCOs. This is ok in case your backend is fast enough as once a SCO is full, it is ready to be saved on the backend and removed from the write buffer. This way it can make room for newly written data. In case the backend is too slow to manage what comes out of the write buffer, Open vStorage will start throttling the ingest of the data on the frontend. So it is essential to have a look at your backend performance in case it is the bottleneck for the write performance.
Since we typically set the SCO size to 64MiB and think a fragment size of 4MiB is a good size for fragments, we change the policy to have 16 data fragments. The other parameters are depending on the reliability and the amount of hosts used for storage.
Compression is typically turned on but gives distorted results when running your typical random IO benchmark as random data is hard to compress. Data which is hard to compress will even be bigger in size and hence take more time to store. Basically, if you are running benchmarks with random IO it is best to turn compression off.

In case you need help in tweaking the performance of your environment, feel free to contact us.

The Game of Distributed Systems Programming. Which Level Are You?

(originally published on the blog, 2012/03/28)


When programming distributed systems becomes part of your life, you go through a learning curve. This article tries to describe my current level of understanding of the field, and hopefully points out enough mistakes for you to be able follow the most optimal path to enlightenment: learning from the mistakes of others.
For the record: I entered Level 1 in 1995, and I’m currently Level 3. Where do you see yourself?

Level 0: Clueless

Every programmer starts here. I will not comment too much here as there isn’t a lot to say. Instead, I quote some conversations I had, and offer some words of advice to developers that never battled distributed systems.

NN1:”replication in distributed systems is easy, you just let all the machines store the item at the same time

Another conversation (from the back of my memory):

NN: “For our first person shooter, we’re going to write our own networking engine”
ME: “Why?”
NN: “There are good commercial engines, but license costs are expensive and we don’t want to pay these.”
ME: “Do you have any experience in distributed systems?”
NN: “Yes, I’ve written a socket server before.”
ME: “How long do you think you will take to write it?”
NN: “I think 2 weeks. Just to be really safe we planned 4.”

Sometimes it’s better to remain silent.

Level 1: RPC

RMI is a very powerful technique for building large systems. The fact that the technique can be described, along with a working example, in just a few pages, speaks volumes of Java. RMI is tremendously exciting and it’s simple to use. You can call to any server you can bind to, and you can build networks of distributed objects. RMI opens the door to software systems that were formerly too complex to build.

Peter van der Linden, Just Java (4th edition, Sun Microsystems)

Let me start by saying I’m not dissing this book. I remember disctinctly it was fun to read (especially the anecdotes between the chapters), and I used it for the Java lessons I used to give (In a different universe, a long time ago). In general, I think well of it. His attitude towards RMI however, is typical of Level 1 distributed application design. People that reside here share the vision of unified objects. In fact, Waldo et al describe it in detail in their landmark paper “a note on distributed computing” (1994), but I will summarize here:
The advocated strategy to writing distributed applications is a three phase approach. The first phase is to write the application without worrying about where objects are located and how their communication is implemented. The second phase is to tune performance by “concretizing” object locations and communication methods. The final phase is to test with “real bullets” (partitioned networks, machines going down, …).

The idea is that whether a call is local or remote has no impact on the correctness of a program.

The same paper then disects this further and shows the problems with it. It has thus been known for almost 20 years that this concept is wrong. Anyway, if Java RMI achieved one thing, it’s this: Even if you remove transport protocol, naming and binding and serialization from the equation, it still doesn’t work. People old enough to rember the hell called CORBA will also remember it didn’t work, but they have an excuse: they were still battling all kinds of lower level problems. Java RMI took all of these away and made the remaining issues stick out. There are two of them. The first is a mere annoyance:

Network Transparency isn’t

Let’s take a look at a simple Java RMI example (taken from the same ‘Just Java’)

[code language=”java”]
public interface WeatherIntf extends javva.rmi.Remote{
public String getWeather() throws java.rmi.RemoteException;


A client that wants to use the weather service needs to do something like this:

[code language=”java”]
Remote robj = Naming.lookup("//localhost/WeatherServer");
WeatherIntf weatherserver = (WeatherInf) robj;
String forecast = weatherserver.getWeather();
System.out.println("The weather will be " + forecast);
}catch(Exception e){

The client code needs to take RemoteExceptions into account.
If you want to see what kinds of remote failure you can encounter, take a look at the more than 20 subclasses. Ok, so your code will be a tad less pretty. We can live with that.

Partial Failure

The real problem with RMI is that the call can fail partially. It can fail before the action on the other tier is invoked, or the invocation might succeed but the return value might not make it afterwards, for whatever reason. These failure modes are in fact the very defining property of distributed systems or otherwise stated:

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”
(Leslie Lamport)

If the method is just the retrieval of a weather forecast, you can simply retry, but if you were trying to increment a counter, retrying can have results ranging from 0 to 2 updates. The solution is supposed to come from idempotent actions, but building those isn’t always possible. Moreover, since you decided on a semantic change of your method call, you basically admit RMI is different from a local invocation. This is an admission of RMI being a fallacy.

In any case the paradigm is a failure as both network transparency and architectural abstraction from distribution just never materialise. It also turns out that some software methodologies are more affected than others. Some variations of scrum tend to prototype. Prototypes concentrate on the happy path and the happy path is not the problem. It basically means you will never escape Level 1. (sorry, this was a low blow. I know)

People who do escape Level 1 understand they need to address the problem with the respect it deserves. They abandon the idea of network transparency, and attack the handling of partial failure strategically.

Level 2: Distributed Algorithms + Asynchronous messaging + Language support

<sarcasm>”Just What We Need: Another RPC Package” </sarcasm>
(Steve Vinoski)

Ok, you’ve learned the fallacies of distributed computing. You decided to bite the bullet, and model the message passing explicitly to get a control of failure.
You split your application into 2 layers, the bottom being responsible for networking and message transport, while the upper layer deals with the arrival of messages, and what needs to be done when they do.
The upper layer implements a distributed state machine, and if you ask the designers what it does, they will tell you something like : “It’s a multi-paxos implementation on top of TCP”.
Development-wise, the strategy boils down to this: Programmers first develop the application centrally using threads to simulate the different processes. Each thread runs a part of the distributed state machine, and basically is responsible for running a message handling loop. Once the application is locally complete and correct, the threads are taken away to become real processes on remote computers. At this stage, in the absence of network problems, the distributed application is already working correctly. In a second phase fault tolerance can be straighforwardly achieved by configuring each of the distributed entities to react correctly to failures (I liberally quoted from “A Fault Tolerant Abstraction for Transparent Distributed Programming).

Partial failure is handled by design, because of the distributed state machine. With regards to threads, there are a lot of options, but you prefer coroutines (they are called fibers, Light weight threads, microthreads, protothreads or just theads in various programming languages, causing a Babylonic confusion) as they allow for fine grained concurrency control.

Combined with the insight that “C ain’t gonna make my network any faster”, you move to programming languages that support this kind of fine grained concurrency.
Popular choices are (in arbitrary order)

(Note how they tend to be functional in nature)

As an example, let’s see what such code looks like in Erlang (taken from Erlang concurrent programming)

[code language=”erlang”]

-export([start/0, ping/2, pong/0]).

ping(0, Pong_PID) -&gt;
Pong_PID ! finished,
io:format("ping finished~n", []);

ping(N, Pong_PID) -&gt;
Pong_PID ! {ping, self()},
pong -&gt;
io:format("Ping received pong~n", [])
ping(N – 1, Pong_PID).

pong() -&gt;
finished -&gt;
io:format("Pong finished~n", []);
{ping, Ping_PID} -&gt;
io:format("Pong received ping~n", []),
Ping_PID ! pong,

start() -&gt;
Pong_PID = spawn(tut15, pong, []),
spawn(tut15, ping, [3, Pong_PID]).

This definitely looks like a major improvement over plain old RPC. You can start reasoning over what would happen if a message doesn’t arrive.
Erlang gets bonus points for having Timeout messages and a builtin after Timeout construct that lets you model and react to timeouts in an elegant manner.

So, you picked your strategy, your distributed algorithm, your programming language and start the work. You’re confident you will slay this monster once and for all, as you ain’t no Level 1 wuss anymore.

Alas, somewhere down the road, some time after your first releases, you enter troubled waters. People tell you your distributed application has issues. The reports are all variations on a theme. They start with a frequency indicator like “sometimes” or “once”, and then describe a situation where the system is stuck in an undesirable state. If you’re lucky, you had adequate logging in place and start inspecting the logs. A little later, you discover an unfortunate sequence of events that produced the reported situation. Indeed, it was a new case. You never took this into consideration, and it never appeared during the extensive testing and simulation you did. So you change the code to take this case into account too.

Since you try to think ahead, you decide to build a monkey that pseudo randomly lets your distributed system do silly things. The monkey rattles its cage and quickly you discover a multitude of scenarios that all lead to undesirable situations like being stuck (never reaching consensus) or even worse: reaching an inconsistent state that should never occur.

Having a monkey was a great idea, and it certainly reduces the chance of encountering something you’ve never seen before in the field. Since you believe that a bugfix goes hand in hand with a testcase that first produced the bug, and now proves its demise, you set out to build just that test. Your problem however is reproducing the failure scenario is difficult, if not impossible. You listen to the gods as they hinted when in doubt, use brute force. So you produce a tests that runs a zillion times to compensate the small probability of the failure. This makes your bug fixing process slow and your test suites bulky. You compensate again by doing divide and conquer on your volume of testsets. Anyway, after a heavy investment of effort and time, you somehow manage to get a rather stable system and ditto process.

You’re maxed out on Level 2. Without new insights, you’ll be stuck here forever.

Level 3: Distributed Algorithms + Asynchronous messaging + Purity

It takes a while to realise that a combination of long running monkeys to discover evil scenarios and brute force to reproduce them ain’t making it. Using brute force just demonstrates ignorance. One of the key insights you need is that if you could only remove indeterminism from the equation, you would have perfect reproducibility of every scenario. A major side effect of Level 2 distributed programming is that your concurrency model tends to go viral on your codebase. You desired fine grained concurrency control… well you got it. It’s everywhere. So concurrency causes indeterminism and indeterminism causes trouble. So concurrency must go. You can’t abandon it: you need it. You just have to ban it from mingling with your distributed state machine. In other words, your distributed state machine has to become a pure function. No IO, No Concurrency, no nothing. Your state machine signature will look something like this

[code language=”fsharp”]
module type SM = sig
type state
type action
type msg
val step: msg -&gt; state -&gt; action * state

You pass in a message and a state, and you get an action and a resulting state. An action is basically anything that tries to change the outside world, needs time to do so, and might fail while trying. Typical actions are

  • send a message
  • schedule a timeout
  • store something in persistent storage

The important thing to realise here is that you can only get to a new state via a new message. nothing else. The benefits of such a strict regime are legio. Perfect control, perfect reproducibility and perfect tracibility. The costs are there too. You’re forced to reify all your actions, which basically is an extra level of indirection to reduce your complexity. You also have to model every change of the outside world that needs your attention into a message.

Another change from Level 2 is the change in control flow. At Level 2, a client will try to force an update and set the machinery in motion. Here, the distributed state machine assumes full control, and will only consider a client’s request when it is ready and able to do something useful with it. So these must be detached.

If you explain this to a Level 2 architect, (s)he will more or less accept this as an alternative. It, however, takes a sufficient amount of pain (let’s call it experience or XP) to realize it’s the only feasible alternative.

Level 4: Solid domination of distributed systems: happiness, piece of mind and a good night’s rest

To be honest, as I’m a mere Level 3 myself, I don’t know what’s up here. I am convinced that both functional programming and asynchronous message passing are parts of the puzzle, but it’s not enough.
Allow me to reiterate what I’m struggling against. First, I want my distributed algorithm implementation to fully cover all possible cases.
This is a big deal to me as I’ve lost lots of sleep being called in on issues in deployed systems (Most of these turn out to be PEBKAC but some were genuine, and cause frustration). It would be great to know your implementation is robust. Should I try theorem provers, should I do exhaustive testing ? I don’t know.
As an aside, for an append only btreeish library called baardskeerder, we know we covered all cases by exhaustively generating insert/delete permutations and asserting their correctness. Here, it’s not that simple, and I’m a bit hesitant to Coqify the codebase.
Second, for reasons of clarity and simplicity, I decided not to touch other, orthogonal requirements like service discovery, authentication, authorization, privacy and performance.
With regard to performance, we might be lucky as the asynchronuous message passing at least doesn’t seem to contradict performance considerations.
Security however is a real bitch as it crosscuts almost everything else you do. Some people think security is a sauce that you can pour over your application to make it secure.
Alas, I never succeeded in this, and currently think it also needs to be addressed strategically during the very first stages of design.

Closing words

Developing robust distributed systems is a difficult problem that is practically unsolved, or at least not solved to my satisfaction.
I’m sure its importance will increase significantly as latency between processors and everything else increases too. This results in an ever growing area of application for this type of application development.

As far as Level 4 goes, maybe I should ask Peter Van Roy. Over the years, I’ve read a lot of his papers, and they offered me a lot of insight in my own mistakes. The downside of insight is that you see others repeating your mistakes and most of the time, I fail to convince people they should do it differently.
Probably, this is because I cannot offer the panacea they want. They want RPC and they want it to work. It’s perverse … almost religious

Tech Field Day Interview at OpenStack Summit

At the OpenStack Summit in Austin, Wim Provoost got interviewed by Stephen Foskett. Stephen is the organizer of the Tech Field Day events, an independent IT influencer event. In this interview they discuss Open vStorage and talk about the benefits it can bring to enterprises that are looking to run OpenStack in production.

Playing with Open vStorage and Docker

I was looking for a way to play with Open vStorage on my laptop with as ultimate goal letting people easily experience Open vStorage without having to rack a whole cluster. The idea of running Open vStorage inside Docker, the open container platform, sounded pretty cool so I accepted the challenge to create a hyperconverged setup with docker images. In this blog post I will show you how to build a cluster of 2 nodes running Open vStorage on a single server or laptop. You could also use this dockerised approach if you want to deploy Open vStorage without having to reinstall the server or laptop after playing around with Open vStorage.

As I’m running Windows on my laptop, I started with creating a virtual machine in VirtualBox. The Virtual Machine I created has 4 dynamically allocated disks (OS – 8GB, Cache/DB/scrubber – 100GB and 2 backend disks of 100GB). The VM has 1 network card which is bridged. The exact details of the VM can be found below.

The steps to run Open vStorage in a Docker container:
Install Ubuntu 14.04 ( in the VM.
Next install Docker:

sudo apt-key adv --keyserver hkp:// --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
echo 'deb ubuntu-trusty main' | sudo tee /etc/apt/sources.list.d/docker.list
sudo apt-get update -qq
sudo apt-get purge lxc-docker
sudo apt-get install -y docker-engine

As I wanted to build a multi-container setup, the Open vStorage containers must be able to communicate with each other over the network. To achieve this I decided to use Weave, the open-source multi-host Docker networking project.

sudo curl -L -o /usr/local/bin/weave
sudo chmod a+x /usr/local/bin/weave

As a next step, download the Open vStorage Docker template and turn it into a Docker image. The Docker image has the latest Open vStorage packages already preloaded for your convenience. Currently the docker image is only hosted on GitHub but it will be pushed to the Docker Hub later on.

gzip -dc ovshc_unstable_img.tar.gz | sudo docker load

To make things easier we have an ovscluster setup script. This script uses the Docker CLI to create an Open vStorage container based upon the Docker image, joins the container to the Open vStorage cluster and configures the default settings of the container. You can download the cluster setup script from GithHub:

chmod +x

Get the Weave network up and running. Naturally a free IP range for the weave network is required. In this case I’m

sudo weave launch --ipalloc-range

Now use the cluster setup script to create the cluster and add the first Open vStorage container. I’m using ovshc1 as hostname for the container and as IP.

sudo ./ create ovshc1

Once the first host of the Open vStorage cluster is created, you will get a bash shell inside the container. Do not exit the shell as otherwise the container will be removed.
That is all it takes to get Open vStorage container up and running!

Time to create a Backend, vPool and vDisk
Since the Open vStorage container is now fully functional, it is time to do something with it: create a vDisks on it. But, fefore the container can handle vDisks, it needs a backend to store the the actual data. As backend I’ll create an ALBA backend, the native backend, in Open vStorage on top of 2 additional Virtual Machine disks.

Surf with your browser to the Open vStorage GUI at https:// and log in with the default login and password (admin/admin).

As a first step it is required to assign a read, write, DB and scrubber role to at least one of the disks of the container. In production always use an SSD for the DB and cache roles. The scrubber can be on a SATA drive. Select Storage Routers from the top menu. Select ovshc1.weave.local to see the details of the Storage Router and select the Physical disk tab. Select the gear icon of the first 100GB disk.
Assign the read, write, DB and scrubbing role and click Finish.
Wait until the partitions are configured (this could take upto 30 seconds).

Once these basic roles are assigned, it is time to create a new backend and assign the 2 additional virtual machine disks as ASDs (ALBA Storage Daemon) to that backend.

Select Backends from the top menu and click Add Backend. Give the new backend a name and click Finish.
Wait until the newly created backend becomes available (status will be green) and you see the disks of ovshc1.weave.local.
Initialize the first 2 disks by clicking the arrow and select Initialize.
Wait until the status of both disks turns dark blue and claim them for the backend.
Select the presets tab and select Add preset.
Give the new preset a name and select advanced settings and indicate you understand the risks of specifying a custom policy. Add a policy with k=1, m=0, c=1 and x=1 (store data on a single disk without any parity fragments on other disks – NEVER user this policy in production!).
Click Next and Finish.

On top of that backend a vpool is required (like a datastore in VMware) before a vDisk can be created. Select vPools from the top menu and click Add new vPool. Give the vpool a name and click the Reload button to load the backend. Select the newly created preset and click Next.

Leave the settings on the second screen of the wizard to the default ones and click Next. Set the read cache to 10GB and the write cache to 10GB. Click Next, Next and Finish to create the vPool.

If all goes well, on ovshc1, you will have one disk assigned for the DB/Cache/Scrubber roles for internal use, two disks assigned to ALBA, and one vPool exported for consumption by the cluster.

root@ovshc1:~# df -h
/dev/sdb1 63G 1.1G 59G 2% /mnt/hdd1
/dev/sdc1 64G 42M 64G 1% /mnt/alba-asd/IoeSU0gwIFk591fX9MqZ15CuIx8uMWV2
/dev/sdd1 64G 42M 64G 1% /mnt/alba-asd/Acxydl4BszLlppGKwvWCL2c2Jw7dTWW0
601a3b34-9426-4dc5-9c35-84fac81b42b6 64T 0 64T 0% /exports/vpool

The same vpool can be seen on the VM as follows:

$> df -h
601a3b34-9426-4dc5-9c35-84fac81b42b6 64T 0 64T 0% /mnt/ovs/vpool

Time to create that vDisk. Open a new session to the VM and goto /mnt/ovs/vpool (replace vpool by the name of your vpool). To see if the vPool is fully functioning create .raw disk, the raw disk format used by KVM, and put some load on the disk with fio.Check the Open vStorage GUI to see its perfromance!

sudo truncate -s 123G /mnt/ovs/vpool/diskname.raw
sudo apt-get install fio
fio -name=temp-fio --bs=4k --ioengine=libaio --iodepth=64 --size=1G --direct=1 --rw=randread --numjobs=12 --time_based --runtime=60 --group_reporting --filename=/mnt/ovs/vpool/diskname.raw

Adding a second host to the cluster
To add a second Open vStorage container to the cluster, create another VM with the same specs. Install Ubuntu, Docker and Weave. Just like for the first container download the Docker image and cluster script.

sudo apt-key adv --keyserver hkp:// --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
echo 'deb ubuntu-trusty main' | sudo tee /etc/apt/sources.list.d/docker.list
sudo apt-get update -qq
sudo apt-get purge lxc-docker
sudo apt-get install -y docker-engine
sudo curl -L -o /usr/local/bin/weave
sudo chmod a+x /usr/local/bin/weave
gzip -dc ovshc_unstable_img.tar.gz | docker load
chmod +x

To launch and join a second Open vStorage container execute:

sudo weave launch --ipalloc-range
sudo ./ join ovshc2

From A(pp) to B(ackend) – no compromise

NoCompromiseWhile giving presentations I often get the question how Open vStorage is different compared to other block or scalable storage solutions in the market. My answer to that question is the following:

It is the only no-compromise storage platform as it combines the best of block, file and object storage into one storage platform.

Allow me to explain in more detail why I’m confident that Open vStorage fits that description. For many readers the first part (Block and Object) will be well known but for the sake of clarity I’d like to start with it.

Block and object:

Today there are 2 types of storage solutions which matter in the storage market: block and object storage:

  • Block storage, typically used for Virtual Machines and IO-intensive application such as databases, are best known for their performance. They provide high bandwidth, low latency storage and their value is typically addressed in IOPS/$. They also offer advanced data management features such as zero-copy snapshots, linked clones etc.The drawback of these block storage solutions is that they have limited scalability and are constrained to a single location. SANs, the most common block storage solution these days, are not only vulnerable to site failures but even a 2 disk failure can cause major data loss. Traditional names selling block storage are EMC, Netapp (3PAR) and almost all big name vendors have a flag-ship SAN.
  • Object Storage, typically used to store files and backups, are designed to be extremely scalable. To make sure data is stored safely against every possible disaster, data gets distributed across multiple locations. This distributed approach comes at the cost of high latency and low bandwidth compared to block storage. Object Storage solutions also only offer a simple interface (get/put) without the advanced data management features. SwiftStack (Swift), Amplidata, Cleversafe and Scality are well-known names which are selling Object Storage solutions.

If you analyse the pro’s and cons of both solutions, it is easy to see that these are 2 completely different but complementary solutions.

 ,Block Storage,Object Storage
+, High performance & low latency & advanced data management,Highly distributed & fault tolerant& highly scalable
-, Limited scalability & single site,Slow performance & no snapshots and clones [/table]

Data Flow:

If you look how Open vStorage takes care of the data flow from an application to the backend and back, it is easy to see that Open vStorage is no-compromise storage. It basically takes the best of both the block and the object world and combines it into a single solution. Allow me to explain by means of the different layers Open vStorage is built upon:

Open vStorage Data Flow

Open vStorage offers to applications a wide set of access protocols: Block (QEMU,native/iSCSI), File (NFS/SMB), Object (S3 & Swift), HDFS and many more.Underneath this pass-through interface layer which offers all these different protocols, all IO requests receive a performance boost by the Acceleration Layer. This layer exposes itself as a block storage layer and uses SSDs, PCIe-flash cards and a log structured approach to offer unmatched performance. On a write, data gets appended to the write buffer of that application and immediately acknowledged to the application. This allows for sub-millisecond latency as required by databases. On top, for example each virtual disk will have its own write buffer and hence the IO-blender effect is completely eliminated.

Once data leaves the Acceleration Layer, it goes into the Data Management Layer which offers the same data management functionality as high-end SANs: zero-copy snapshots, quick cloning, Distributed Transaction Logs (protection against an SSD failure) and many more. After the Data Management Layer, data goes to the Distribution layer. In this layer incoming writes which are bundles by the Acceleration layer in Storage Container Objects (SCOs) are optimized to be always accessible at a minimal overhead. Typically each object (a collection of consecutive writes) will be chopped into different fragments and extended with some parity fragments. These fragments are in the end stored across different nodes or even datacenters.
The next layer takes care of the optional encryption and compression of the different fragments before they are dispatched with the appropriate write protocol of the backend.

If you look at this dataflow from a distance, you will see that the Acceleration and Data Management Layer are giving Open vStorage the positive features of block storage: superb performance, low latency, zero-copy snapshots, quick cloning etc. The Distribution and Compression layer are giving Open vStorage the favorable features of object storage: scalability, highly distributed, ability to survive site failures etc.

To conclude, Open vStorage truly is the only storage solution which combines the best of both the block and object storage world in a single solution. Told you so!

Deploy an Open vStorage cluster with Ansible

ansible_logoAt Open vStorage we build large Open vStorage clusters for customers. To prevent errors and cut-down the deployment time we don’t set up these clusters manually but we automate the deployment through Ansible, a free software platform for configuring and managing IT environments.

Before we dive into the Ansible code, let’s first have a look at the architecture of these large clusters. For large setups we take the converged (HyperScale as we call it) approach and split up storage and compute in order to scale compute and storage independently. From experience we have learned that storage on average grows 3 times as fast as compute.

We also use 3 types of nodes: controllers, compute and storage nodes.

  • Controllers: 3 dedicated, hardware optimized nodes to run the master services and hold the distributed DBs. There is no vPool configured on these nodes so no VMs are running on them.These nodes are equipped with a couple of large capacity SATA drives for scrubbing.
  • Compute: These nodes run the extra services, are configured with vPools and run the VMs. We typically use blades or 1U servers for these servers as they are only equipped with SSDs or PCIe flash cards.
  • Storage: The storage servers, 2U or 4U, are equipped with a lot of SATA drives but have less RAM and CPU.

The below steps will teach you how to setup an Open vStorage cluster through Ansible (Ubuntu) on these 3 types of nodes. Automating Open vStorage can of course also be achieved in a similar fashion with other tools like Puppet or Chef.

  • Install Ubuntu 14.04 on all servers of the cluster. Username and password should be the same on all servers.
  • Install Ansible on a pc or server you can use as Control Machine. The Control Machine is used to send instructions to all hosts in the Open vStorage cluster. Note that the Control Machine should not be part of the cluster so it can later also be used for troubleshooting the Open vStorage cluster.

    sudo apt-get install software-properties-common
    sudo apt-add-repository ppa:ansible/ansible
    sudo apt-get update
    sudo apt-get install ansible

  • Create /usr/lib/ansible, download the Open vStorage module to the Control Machine and put the module in /usr/lib/ansible.

    mkdir /opt/openvstorage/
    cd /opt/openvstorage/
    git clone -b release1.0
    mkdir /usr/lib/ansible
    cp dev_ops/Ansible/openvstorage_module_project/ /usr/lib/ansible

  • Edit the Ansible config file (/etc/ansible/ansible.cfg) describing the library. Uncomment it and change it to /usr/lib/ansible

    vim /etc/ansible/ansible.cfg

    #inventory = /etc/ansible/hosts
    #library = /usr/share/my_modules/

    inventory = /etc/ansible/hosts
    library = /usr/lib/ansible

  • Edit the Ansible inventory file (/etc/ansible/hosts) and add the controller, compute and storage nodes to describe the cluster according to the below example:

    # This is the default ansible 'hosts' file.

    #cluster overview

    ctl01 ansible_host= hypervisor_name=mas01
    ctl02 ansible_host= hypervisor_name=mas02
    ctl03 ansible_host= hypervisor_name=mas03

    cmp01 ansible_host= hypervisor_name=hyp01

    str01 ansible_host=

    #cluster details


  • Execute the Open vStorage HyperScale playbook. (It is advised to execute the playbook in debug mode -vvvv)

    cd /opt/openvstorage/dev_ops/Ansible/hyperscale_project/
    ansible-playbook openvstorage_hyperscale_setup.yml -k -vvvv

The above playbook will install the necessary packages and run ‘ovs setup’ on the controllers, compute and storage nodes. Next steps are assigning roles to the SSDs and PCIe flash cards, create the backend and create the first vPool.

QEMU, Shared Memory and Open vStorage

qemuQEMU, Shared Memory and Open vStorage, it sounds like the beginning of a bad joke but actually it is a very cool story. Open vStorage secretly released in their latest version a Shared Memory Client/Server integration with the VolumeDriver (the component that offers the fast, distributed block layer). With this implementation the client (QEMU, Blktap, …) can write to a dedicated memory segment on the compute host which is shared with the Shared Memory Server in the Volume Driver. For the moment the Shared Memory client understands only block semantics but in the future we will add file semantics as to integrate an NFS server.

The benefits of the Shared Memory approach are very tangible:

  • As everything is in user-space, data copies from user to kernel space are eliminated so the IO performance is about 30-40% higher.
  • CPU consumption is about half for the same IO performance.
  • Easy way to build additional interfaces (f.e. block devices, iSCSI, … ) on top.

We haven’t integrated our modified QEMU build with Libvirt so at the moment some manual tweaking is still required if you want to give it a go:

Download the volumedriver-dev packages

sudo apt-get install volumedriver-dev

By default the Shared memory Server is disabled. To enable it, update the vPool json (/opt/OpenvStorage/config/storagedriver/storagedriver/vpool_name.json) and add under filesystem an entry “fs_enable_shm_interface”: true,. After adding the entry, restart the Volume Driver for the vPool (restart ovs-volumedriver_vpool_name).
Next, build QEMU from the source. You can find the source here.

git clone
cd qemu/
sudo make install

There are 2 ways to create a QEMU vDisk:
Use QEMU to create the disk:

qemu-image create openvstorage:volume 10G

Alternatively create the disk in FUSE and start a VM by using the Open vStorage block driver:

truncate -s 10G /mnt//volume
qemu -drive file=openvstorage:volume,if=virtio,cache=none,format=raw ...

The Distributed Transaction Log explained

During my 1-on-1 sessions I quite often get the question how Open vStorage makes sure there is no data loss when a host crashes. As you probably already know Open vStorage uses SSDs and PCIe flash cards inside the host where the VM is running to store incoming writes. All incoming writes for a volume get appended to a log file (SCO, Storage Container Object) and once enough write are accumulated the SCO gets stored on the backend. Once the SCO is on the backend Open vStorage relies on the functionality (erasure coding, 3-way replication, …) of the backend to make sure that data is stored safely.

This means there is window where data is vulnerable, when the SCO is being constructed and not yet stored on the Backend. To ensure the vulnerable data isn’t lost when a host crashes, incoming writes are also stored in the Distributed Transaction Log (DTL) on another host in the Open vStorage cluster. Note that the volume can even be restarted on another host than were the DTL was stored.

For the DTL of volume you can select one of the following options as modus operandi:

  • No DTL: when this option is selected incoming data doesn’t get stored in the DTL on another node. This option can be used when performance is key and some data loss is acceptable when the host or storage router goes down. Test VMs or VMs which are running batch or distributed applications (f.e. transcoding of files to another file) can use this option.
  • Asynchronous: when this option is selected the incoming writes are added to a queue on the host and replicated to the DTL on the other host once the queue reaches a certain size or if a certain time is exceeded. To ensure consistency, all outstanding data is synced to the DTL in case a sync is executed within the file system of the VM. Virtual Machines running on KVM can use this option. This mode balances data safety and performance.
  • DTL - async

  • Synchronous: when this option is selected, every write request gets synchronized to the DTL on the other host. This option should be selected when absolutely no data loss is acceptable (distributed NFS, HA iSCSI disks). Since this options synchronizes on every write, it is the slowest mode of the DTL. Note that in case the DTL can’t be reached (f.e. because the host is being rebooted), the incoming I/O isn’t blocked and doesn’t return an I/O error to the VM but an out-of-band event is generated to restart the DTL on another host.
  • DTL - sync