2017, the Open vStorage predictions

2017, the Open vStorage predictions
2017 promises to be an interesting year for the storage industry. New technology is knocking at the door and present technology will not surrender without a fight. Not only new technology will influence the market but the storage market itself is morphing:

Further Storage consolidation

Let’s say that December 2015 was an appetizer with Netapp buying Solidfire. But in 2016 the storage market went through the first wave of consolidation: Docker storage start-up ClusterHQ shut its doors, Violin Memory filed for chapter 11, Nutanix bought PernixData , Nexgen was acquired by Pivot 3, Broadcom acquired Brocade, Samsung acquired Joyent. Lastly there was also the mega merger between storage mogul EMC and Dell. This consolidation trend will continue in 2017 as the environment for hyper-converged, flash and object storage startups is getting tougher because all the traditional vendors now offer their own flavor. As the hardware powering these solutions is commodity, the only differentiator is software.

Some interesting names to keep an eye on for M&A action or closure: Cloudian, Minio, Scality, Scale Computing, Stratoscale, Atlantis Computing, HyperGrid/Gridstore, Pure Storage, Tegile, Kaminario, Tintri, Nibmle Storage, Simplivity, Scale Computing, Primary Data, … We are pretty sure some of these name will not make it past 2017.

Open vStorage has already a couple of large projects lined up. 2017 sure looks promising for us.

The Hybrid cloud

Back from the dead like a phoenix. I expect a new live for the the hybrid cloud. Enterprises increasingly migrated to the public cloud in 2016 and this will only accelerate, both in speed and numbers. There are now 5 big clouds: Amazon AWS, Microsoft Azure, IBM, Google and Oracle.
But connecting these public cloud with in-house datacenter assets will be key. The gap between public and private clouds has never been smaller. AWS and VMware, 2 front runners, are already offering products to migrate between both solutions. Network infrastructure (performance, latency) is now finally also capable of turning the hybrid cloud into reality. Numerous enterprises will realise that going to the public cloud isn’t the only option for future infrastructure. I believe migration of storage and workloads will be one of the hottest features of Open vStorage in 2017. Hand in hand with the migration of workloads we will see the birth of various new storage as a service providers offering S3, secondary but also primary storage out of the public cloud.

On a side note, HPE (Helion), Cisco (Intercloud) and telecom giant Verizon closed their public cloud in 2016. It will be good to keep an eye out on these players to see what they are up to in 2017.

The end of Hyper-Convergence hype

In the storage market prediction for 2015 I predicted the rise of hyper-convergence. Hyper-converged solutions have lived up to their expectations and have become a mature software solution. I believe 2017 will mark a turning point for the hyper-convergence hype. Let’s sum up some reasons for the end of the hype cycle:

  • The hyper-converged market is mature and the top use cases have been identified: SMB environments, VDI and Remote Office/Branch Office (ROBO).
  • Private and public clouds are becoming more and more centralised and large scale. More enterprises will come to understand that the one-size-fits-all and everything-in-a-single-box approach of hyper-converged systems doesn’t scale to a datacenter level. This is typically an area where hyper-converged solutions reach their limits.
  • The IT world works like a pendulum. Hyper-convergence brought flash as cache into the server as the latency to fetch data over the network was too high. With RDMA and round trip times of 10 usec and below, the latency of the network is no longer the bottleneck. The pendulum is now changing its direction as the so web-scalers, the companies on which the hyper-convergence hype is ented, want to disaggregate storage by moving flash out of each individual server into more flexible, centralized repositories.
  • Flash, Flash, Flash, everything is becoming flash. As stated earlier, the local flash device was used to accelerate slow SATA drives. With all-flash versions, these hyper-converged solutions go head to head with all-flash arrays.

One of the leaders of the hyper-converged pack has already started to move into the converged infrastructure direction by releasing a storage only appliance. It will be interesting to see who else follows.

With the new Fargo architecture which is designed for large scale, multi petabyte, multi datacenter environments, we already capture the next trend: meshed, hyper-aggregated architectures. The Fargo release supports RDMA, allows to built all-flash storage pools and incorporates a distributed cache across all flash in the datacenter. 100% future proof and ready to kickstart 2017.

PS. If you want to run Open vStorage hyper-converged, feel free to do so. We have componetized Open vStorage so you can optimize it for your use case: run everything in a single box or spread the components across different servers or even datacenters!

IoT storage lakes

More and more devices are connected to the internet. This Internet of Things (IoT) is posed to generate a tremendous amount of data. Not convinced? Intel research for example estimated that autonomous cars will produce 4 terabytes of data daily per car. These Big Data lakes need a new type of storage: storage which is ultra-scalable. Traditional storage is simply not suited to process this amount of storage. On top in 2017 we will see artificial intelligence increasingly being used to mine data in these lakes. This means the performance of the storage needs to able to serve real-time analytics. Since IoT device can be located anywhere in the world, geo-redundancy and geo-distribution are also required. Basically IoT use cases are a perfect match for the Open vStorage technology.

Some interesting fields and industries to follow are consumer goods (smart thermostats, IP cameras, toys, …), automotive and healthcare.

The Game of Distributed Systems Programming. Which Level Are You?

(originally published on the incubaid.com blog, 2012/03/28)

Introduction

When programming distributed systems becomes part of your life, you go through a learning curve. This article tries to describe my current level of understanding of the field, and hopefully points out enough mistakes for you to be able follow the most optimal path to enlightenment: learning from the mistakes of others.
For the record: I entered Level 1 in 1995, and I’m currently Level 3. Where do you see yourself?

Level 0: Clueless

Every programmer starts here. I will not comment too much here as there isn’t a lot to say. Instead, I quote some conversations I had, and offer some words of advice to developers that never battled distributed systems.

NN1:”replication in distributed systems is easy, you just let all the machines store the item at the same time

Another conversation (from the back of my memory):

NN: “For our first person shooter, we’re going to write our own networking engine”
ME: “Why?”
NN: “There are good commercial engines, but license costs are expensive and we don’t want to pay these.”
ME: “Do you have any experience in distributed systems?”
NN: “Yes, I’ve written a socket server before.”
ME: “How long do you think you will take to write it?”
NN: “I think 2 weeks. Just to be really safe we planned 4.”

Sometimes it’s better to remain silent.

Level 1: RPC

RMI is a very powerful technique for building large systems. The fact that the technique can be described, along with a working example, in just a few pages, speaks volumes of Java. RMI is tremendously exciting and it’s simple to use. You can call to any server you can bind to, and you can build networks of distributed objects. RMI opens the door to software systems that were formerly too complex to build.

Peter van der Linden, Just Java (4th edition, Sun Microsystems)

Let me start by saying I’m not dissing this book. I remember disctinctly it was fun to read (especially the anecdotes between the chapters), and I used it for the Java lessons I used to give (In a different universe, a long time ago). In general, I think well of it. His attitude towards RMI however, is typical of Level 1 distributed application design. People that reside here share the vision of unified objects. In fact, Waldo et al describe it in detail in their landmark paper “a note on distributed computing” (1994), but I will summarize here:
The advocated strategy to writing distributed applications is a three phase approach. The first phase is to write the application without worrying about where objects are located and how their communication is implemented. The second phase is to tune performance by “concretizing” object locations and communication methods. The final phase is to test with “real bullets” (partitioned networks, machines going down, …).

The idea is that whether a call is local or remote has no impact on the correctness of a program.

The same paper then disects this further and shows the problems with it. It has thus been known for almost 20 years that this concept is wrong. Anyway, if Java RMI achieved one thing, it’s this: Even if you remove transport protocol, naming and binding and serialization from the equation, it still doesn’t work. People old enough to rember the hell called CORBA will also remember it didn’t work, but they have an excuse: they were still battling all kinds of lower level problems. Java RMI took all of these away and made the remaining issues stick out. There are two of them. The first is a mere annoyance:

Network Transparency isn’t

Let’s take a look at a simple Java RMI example (taken from the same ‘Just Java’)

[code language=”java”]
public interface WeatherIntf extends javva.rmi.Remote{
public String getWeather() throws java.rmi.RemoteException;
}

[/code]

A client that wants to use the weather service needs to do something like this:

[code language=”java”]
try{
Remote robj = Naming.lookup("//localhost/WeatherServer");
WeatherIntf weatherserver = (WeatherInf) robj;
String forecast = weatherserver.getWeather();
System.out.println("The weather will be " + forecast);
}catch(Exception e){
System.out.println(e.getMessage());
}
[/code]

The client code needs to take RemoteExceptions into account.
If you want to see what kinds of remote failure you can encounter, take a look at the more than 20 subclasses. Ok, so your code will be a tad less pretty. We can live with that.

Partial Failure

The real problem with RMI is that the call can fail partially. It can fail before the action on the other tier is invoked, or the invocation might succeed but the return value might not make it afterwards, for whatever reason. These failure modes are in fact the very defining property of distributed systems or otherwise stated:

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”
(Leslie Lamport)

If the method is just the retrieval of a weather forecast, you can simply retry, but if you were trying to increment a counter, retrying can have results ranging from 0 to 2 updates. The solution is supposed to come from idempotent actions, but building those isn’t always possible. Moreover, since you decided on a semantic change of your method call, you basically admit RMI is different from a local invocation. This is an admission of RMI being a fallacy.

In any case the paradigm is a failure as both network transparency and architectural abstraction from distribution just never materialise. It also turns out that some software methodologies are more affected than others. Some variations of scrum tend to prototype. Prototypes concentrate on the happy path and the happy path is not the problem. It basically means you will never escape Level 1. (sorry, this was a low blow. I know)

People who do escape Level 1 understand they need to address the problem with the respect it deserves. They abandon the idea of network transparency, and attack the handling of partial failure strategically.

Level 2: Distributed Algorithms + Asynchronous messaging + Language support

<sarcasm>”Just What We Need: Another RPC Package” </sarcasm>
(Steve Vinoski)

Ok, you’ve learned the fallacies of distributed computing. You decided to bite the bullet, and model the message passing explicitly to get a control of failure.
You split your application into 2 layers, the bottom being responsible for networking and message transport, while the upper layer deals with the arrival of messages, and what needs to be done when they do.
The upper layer implements a distributed state machine, and if you ask the designers what it does, they will tell you something like : “It’s a multi-paxos implementation on top of TCP”.
Development-wise, the strategy boils down to this: Programmers first develop the application centrally using threads to simulate the different processes. Each thread runs a part of the distributed state machine, and basically is responsible for running a message handling loop. Once the application is locally complete and correct, the threads are taken away to become real processes on remote computers. At this stage, in the absence of network problems, the distributed application is already working correctly. In a second phase fault tolerance can be straighforwardly achieved by configuring each of the distributed entities to react correctly to failures (I liberally quoted from “A Fault Tolerant Abstraction for Transparent Distributed Programming).

Partial failure is handled by design, because of the distributed state machine. With regards to threads, there are a lot of options, but you prefer coroutines (they are called fibers, Light weight threads, microthreads, protothreads or just theads in various programming languages, causing a Babylonic confusion) as they allow for fine grained concurrency control.

Combined with the insight that “C ain’t gonna make my network any faster”, you move to programming languages that support this kind of fine grained concurrency.
Popular choices are (in arbitrary order)

(Note how they tend to be functional in nature)

As an example, let’s see what such code looks like in Erlang (taken from Erlang concurrent programming)

[code language=”erlang”]
-module(tut15).

-export([start/0, ping/2, pong/0]).

ping(0, Pong_PID) -&gt;
Pong_PID ! finished,
io:format("ping finished~n", []);

ping(N, Pong_PID) -&gt;
Pong_PID ! {ping, self()},
receive
pong -&gt;
io:format("Ping received pong~n", [])
end,
ping(N – 1, Pong_PID).

pong() -&gt;
receive
finished -&gt;
io:format("Pong finished~n", []);
{ping, Ping_PID} -&gt;
io:format("Pong received ping~n", []),
Ping_PID ! pong,
pong()
end.

start() -&gt;
Pong_PID = spawn(tut15, pong, []),
spawn(tut15, ping, [3, Pong_PID]).
[/code]

This definitely looks like a major improvement over plain old RPC. You can start reasoning over what would happen if a message doesn’t arrive.
Erlang gets bonus points for having Timeout messages and a builtin after Timeout construct that lets you model and react to timeouts in an elegant manner.

So, you picked your strategy, your distributed algorithm, your programming language and start the work. You’re confident you will slay this monster once and for all, as you ain’t no Level 1 wuss anymore.

Alas, somewhere down the road, some time after your first releases, you enter troubled waters. People tell you your distributed application has issues. The reports are all variations on a theme. They start with a frequency indicator like “sometimes” or “once”, and then describe a situation where the system is stuck in an undesirable state. If you’re lucky, you had adequate logging in place and start inspecting the logs. A little later, you discover an unfortunate sequence of events that produced the reported situation. Indeed, it was a new case. You never took this into consideration, and it never appeared during the extensive testing and simulation you did. So you change the code to take this case into account too.

Since you try to think ahead, you decide to build a monkey that pseudo randomly lets your distributed system do silly things. The monkey rattles its cage and quickly you discover a multitude of scenarios that all lead to undesirable situations like being stuck (never reaching consensus) or even worse: reaching an inconsistent state that should never occur.

Having a monkey was a great idea, and it certainly reduces the chance of encountering something you’ve never seen before in the field. Since you believe that a bugfix goes hand in hand with a testcase that first produced the bug, and now proves its demise, you set out to build just that test. Your problem however is reproducing the failure scenario is difficult, if not impossible. You listen to the gods as they hinted when in doubt, use brute force. So you produce a tests that runs a zillion times to compensate the small probability of the failure. This makes your bug fixing process slow and your test suites bulky. You compensate again by doing divide and conquer on your volume of testsets. Anyway, after a heavy investment of effort and time, you somehow manage to get a rather stable system and ditto process.

You’re maxed out on Level 2. Without new insights, you’ll be stuck here forever.

Level 3: Distributed Algorithms + Asynchronous messaging + Purity

It takes a while to realise that a combination of long running monkeys to discover evil scenarios and brute force to reproduce them ain’t making it. Using brute force just demonstrates ignorance. One of the key insights you need is that if you could only remove indeterminism from the equation, you would have perfect reproducibility of every scenario. A major side effect of Level 2 distributed programming is that your concurrency model tends to go viral on your codebase. You desired fine grained concurrency control… well you got it. It’s everywhere. So concurrency causes indeterminism and indeterminism causes trouble. So concurrency must go. You can’t abandon it: you need it. You just have to ban it from mingling with your distributed state machine. In other words, your distributed state machine has to become a pure function. No IO, No Concurrency, no nothing. Your state machine signature will look something like this

[code language=”fsharp”]
module type SM = sig
type state
type action
type msg
val step: msg -&gt; state -&gt; action * state
end
[/code]

You pass in a message and a state, and you get an action and a resulting state. An action is basically anything that tries to change the outside world, needs time to do so, and might fail while trying. Typical actions are

  • send a message
  • schedule a timeout
  • store something in persistent storage

The important thing to realise here is that you can only get to a new state via a new message. nothing else. The benefits of such a strict regime are legio. Perfect control, perfect reproducibility and perfect tracibility. The costs are there too. You’re forced to reify all your actions, which basically is an extra level of indirection to reduce your complexity. You also have to model every change of the outside world that needs your attention into a message.

Another change from Level 2 is the change in control flow. At Level 2, a client will try to force an update and set the machinery in motion. Here, the distributed state machine assumes full control, and will only consider a client’s request when it is ready and able to do something useful with it. So these must be detached.

If you explain this to a Level 2 architect, (s)he will more or less accept this as an alternative. It, however, takes a sufficient amount of pain (let’s call it experience or XP) to realize it’s the only feasible alternative.

Level 4: Solid domination of distributed systems: happiness, piece of mind and a good night’s rest

To be honest, as I’m a mere Level 3 myself, I don’t know what’s up here. I am convinced that both functional programming and asynchronous message passing are parts of the puzzle, but it’s not enough.
Allow me to reiterate what I’m struggling against. First, I want my distributed algorithm implementation to fully cover all possible cases.
This is a big deal to me as I’ve lost lots of sleep being called in on issues in deployed systems (Most of these turn out to be PEBKAC but some were genuine, and cause frustration). It would be great to know your implementation is robust. Should I try theorem provers, should I do exhaustive testing ? I don’t know.
As an aside, for an append only btreeish library called baardskeerder, we know we covered all cases by exhaustively generating insert/delete permutations and asserting their correctness. Here, it’s not that simple, and I’m a bit hesitant to Coqify the codebase.
Second, for reasons of clarity and simplicity, I decided not to touch other, orthogonal requirements like service discovery, authentication, authorization, privacy and performance.
With regard to performance, we might be lucky as the asynchronuous message passing at least doesn’t seem to contradict performance considerations.
Security however is a real bitch as it crosscuts almost everything else you do. Some people think security is a sauce that you can pour over your application to make it secure.
Alas, I never succeeded in this, and currently think it also needs to be addressed strategically during the very first stages of design.

Closing words

Developing robust distributed systems is a difficult problem that is practically unsolved, or at least not solved to my satisfaction.
I’m sure its importance will increase significantly as latency between processors and everything else increases too. This results in an ever growing area of application for this type of application development.

As far as Level 4 goes, maybe I should ask Peter Van Roy. Over the years, I’ve read a lot of his papers, and they offered me a lot of insight in my own mistakes. The downside of insight is that you see others repeating your mistakes and most of the time, I fail to convince people they should do it differently.
Probably, this is because I cannot offer the panacea they want. They want RPC and they want it to work. It’s perverse … almost religious

From A(pp) to B(ackend) – no compromise

NoCompromiseWhile giving presentations I often get the question how Open vStorage is different compared to other block or scalable storage solutions in the market. My answer to that question is the following:

It is the only no-compromise storage platform as it combines the best of block, file and object storage into one storage platform.

Allow me to explain in more detail why I’m confident that Open vStorage fits that description. For many readers the first part (Block and Object) will be well known but for the sake of clarity I’d like to start with it.

Block and object:

Today there are 2 types of storage solutions which matter in the storage market: block and object storage:

  • Block storage, typically used for Virtual Machines and IO-intensive application such as databases, are best known for their performance. They provide high bandwidth, low latency storage and their value is typically addressed in IOPS/$. They also offer advanced data management features such as zero-copy snapshots, linked clones etc.The drawback of these block storage solutions is that they have limited scalability and are constrained to a single location. SANs, the most common block storage solution these days, are not only vulnerable to site failures but even a 2 disk failure can cause major data loss. Traditional names selling block storage are EMC, Netapp (3PAR) and almost all big name vendors have a flag-ship SAN.
  • Object Storage, typically used to store files and backups, are designed to be extremely scalable. To make sure data is stored safely against every possible disaster, data gets distributed across multiple locations. This distributed approach comes at the cost of high latency and low bandwidth compared to block storage. Object Storage solutions also only offer a simple interface (get/put) without the advanced data management features. SwiftStack (Swift), Amplidata, Cleversafe and Scality are well-known names which are selling Object Storage solutions.

If you analyse the pro’s and cons of both solutions, it is easy to see that these are 2 completely different but complementary solutions.

[table]
 ,Block Storage,Object Storage
+, High performance & low latency & advanced data management,Highly distributed & fault tolerant& highly scalable
-, Limited scalability & single site,Slow performance & no snapshots and clones [/table]

Data Flow:

If you look how Open vStorage takes care of the data flow from an application to the backend and back, it is easy to see that Open vStorage is no-compromise storage. It basically takes the best of both the block and the object world and combines it into a single solution. Allow me to explain by means of the different layers Open vStorage is built upon:

Open vStorage Data Flow

Open vStorage offers to applications a wide set of access protocols: Block (QEMU,native/iSCSI), File (NFS/SMB), Object (S3 & Swift), HDFS and many more.Underneath this pass-through interface layer which offers all these different protocols, all IO requests receive a performance boost by the Acceleration Layer. This layer exposes itself as a block storage layer and uses SSDs, PCIe-flash cards and a log structured approach to offer unmatched performance. On a write, data gets appended to the write buffer of that application and immediately acknowledged to the application. This allows for sub-millisecond latency as required by databases. On top, for example each virtual disk will have its own write buffer and hence the IO-blender effect is completely eliminated.

Once data leaves the Acceleration Layer, it goes into the Data Management Layer which offers the same data management functionality as high-end SANs: zero-copy snapshots, quick cloning, Distributed Transaction Logs (protection against an SSD failure) and many more. After the Data Management Layer, data goes to the Distribution layer. In this layer incoming writes which are bundles by the Acceleration layer in Storage Container Objects (SCOs) are optimized to be always accessible at a minimal overhead. Typically each object (a collection of consecutive writes) will be chopped into different fragments and extended with some parity fragments. These fragments are in the end stored across different nodes or even datacenters.
The next layer takes care of the optional encryption and compression of the different fragments before they are dispatched with the appropriate write protocol of the backend.

If you look at this dataflow from a distance, you will see that the Acceleration and Data Management Layer are giving Open vStorage the positive features of block storage: superb performance, low latency, zero-copy snapshots, quick cloning etc. The Distribution and Compression layer are giving Open vStorage the favorable features of object storage: scalability, highly distributed, ability to survive site failures etc.

To conclude, Open vStorage truly is the only storage solution which combines the best of both the block and object storage world in a single solution. Told you so!

2016: Cheers to the New Year!

The past year has been a remarkable one for Open vStorage. We did 2 US roadshows, attended a successful OpenStack summit in Vancouver, moved and open-sourced all of Open vStorage on GitHub and released a lot of new functionality (our own hyperconverged backend, detailed tuning & caching parameters for vDisks, a certified OpenStack Cinder plugin, remote support, CentOS7, …). The year also ended with a bang as customers were trying to beat each other’s top 4k IOPS results.

photo_2015-12-14_13-21-09

While it might look hard to beat the success of 2015, the Open vStorage Team is confident that 2016 will be even more fruitful. Feature wise some highly anticipated features will see the light in Q1: improved QEMU integration, block devices (blktap support), docker support (Flocker), iSCSI disks, replication, support for an all-flash backend, … Next to these product features, the team will open its kimono and discuss the Open vStorage internals in detail in our new GitBook documentation. In order to be more in spirit with the open source community, blueprints of upcoming features will also be published as much as possible to the GitHub repo. A first example is the move of the Open vStorage config files to etcd. Finally, based upon the projects of partners and customers in the pipeline, 2016 will be unrivaled. So far a 39 node cluster (nearly 125TB of flash alone) is being deployed in production and multiple datacenters (US, Europe, Asia) are being transformed to use Open vStorage as storage platform.

Cheers to the New Year. May it be a memorable one.

I like to move it, move it

The vibe at the Open vStorage office is these days best explained by a song of the early nineties:

I like to move it, move it ~ Reel 2 Reel

While the summer time is in most companies a more quiet time, the Open vStorage office is buzzing like a beehive. Allow me to give you a short overview of what is happening:

  • We are moving into our new, larger and stylish offices. The address remains the same but we are moving into a completely remodeled floor of the Idola business center.
  • Next to physically moving desks at the Open vStorage HQ, we are also moving our code from BitBucket to GitHub. We have centralized all our code under https://github.com/openvstorage. To list a few of the projects: Arakoon (our consistent distributed key-value store), ALBA (the Open vStorage default ALternate BAckend) and of course Open vStorage itself. Go check it out!
  • Finishing up our Open vStorage 2.2 GA release.
  • Adding support for RedHat and Cent OS by merging in the Cent-OS branch. There is still some work to do around packaging, testing and upgrades so feel free to give a hand. As this was really a community effort, we owe everyone a big thank you.
  • Working on some very cool features (RDMA anyone?) but let’s keep those for a separate post.
  • Preparation for VMworld (San Francisco) and the OpenStack summit in Tokyo.

As you can see, many things going on at once so prepare for a hot Open vStorage fall!

Meet the team at the OpenStack Summit in Vancouver!

IMG_20150429_152853
As last year, the Open vStorage team will attend the OpenStack Summit in Vancouver. Who would want to miss the success stories of OpenStack in the field, the new developments and many interesting sessions for OPS and developers? Not this team! In case you want to meet us, locate booth T21 on the show floor or scan the summit for someone with a blue shirt with the Open vStorage logo on the back. We will be there with a whole bunch of people so whether you want to discuss business, operations, deep technical issues/wishes or just socialize, we have someone in Vancouver that can address your questions and needs.
For us the kilo release is also time for a little party: Open vStorage will be an official, certified Cinder plugin. How cool is that!

During the OpenStack Summit we will present 2 new developments:

  • The vRUN appliances: a stackable, hyperconverd private cloud appliance based on OpenStack and Open vStorage. The appliance comes by default with Open vStorage support but if needed we can add OpenStack support and monitoring to the mix.
  • The free Open vStorage community edition: the new Open vStorage community edition allows to easily setup a hyperconverged private cloud by using local SATA drives as Tier 2 storage. It allows small users to build and run a complete OpenStack based cloud without any software cost. It also allows users which are looking at Open vStorage as storage layer for a large scale-out cloud to test the hyperconverged functionality without any cost.

Open vStorage US Roadshow Q2

After the successful first Open vStorage Roadshow, we decided to do a second US Roadshow. You can meet us during one of the following Meetups in the US:

During these meetups we will discuss what Open vStorage exactly does and the latest developments around the project such as how to set up Open vStorage in a HyperConverged fashion on local disks and the new metadata server architecture. We will of course provide pizzas and drinks during these meetups!

Next to these community events, we are also organizing 2 business events in the Bay Area (Santa Clara 04/16, Menlo Park 04/20). During these business events we will discuss how to setup a profitable IAAS (Infrastructure as a Service) or private cloud business and unveil our HyperConverged solution build on top of OpenStack and our own Open vStorage. This solution will include 24/7 support for Open vStorage as well as OpenStack! You can register for one of these free, business events here.

You like open-source, storage and writing code?

You like open-source, storage and writing code and you are interested in contributing towards the development and adoption of both Open vStorage and Kinetic technology? Yes? Well in that case you can write code during your spare time and contribute that to the project. Something we of course highly appreciate and will lead to a (Belgian) beer when we meet in real life. But we are aware that some people are looking for a more intense relationship, full-time as freelancer or on the payroll. If you would love to contribute to this exciting project, let us know! I will be sitting by my mailbox (wim@openvstorage.com) waiting for you. We have developers from everywhere in the world working on this project so we accept candidates from all over the world.

We are looking for people who have following skills:

  • In depth knowledge of FUSE, NFS, GlusterFS, Ceph
  • In depth knowledge of QEMU, VMware, KVM, Xen, Docker
  • Hardcore C++ or OCaml coder
  • Experience in kernel or device drivers
  • Experience with distributed applications such as Hadoop (optional)

Hamburgers, french fries and hyperconvergence

hamburgerDuring the first Open vStorage roadshow in the US, I noticed people have a lot of questions about convergence and hyperconvergence:

Can you help me with the term “hyperconverged”? I believe it is a marketing buzzword, but it is something that my executives have glommed onto.

While I was waiting to fly back home, I was eating a burger and french fries. Let’s be honest, the US has the best places to eat burgers but while eating and staring at the planes, I suddenly had an epiphany on how to explain convergence, hyperconvergence and how Open vStorage is related: burgers and french fries.

Let’s say that hamburgers are the compute (RAM, CPU, the host where the VM’s are running), french fries are the storage and let barbecue sauce be the storage performance. In that case a converged solution is like ordering a hamburger menu. One SKU will get you a plate with a hamburger and french fries on the side. You even have different menus with smaller or bigger hamburgers and more or less french fries. When you order a ‘converged burger’ the barbecue sauce will be on the french fries (SSDs inside the SAN). It works but it is not ideal. With a ‘hyperconverged burger’, instead of receiving french fries separately, you will receive a single hamburger with french fries and barbecue sauce as topping of the burger. Allow me to explain. With a hyperconverged appliance both the compute (hamburger), the Tier 1 (barbecue sauce) and Tier 2 (french fries) storage are inside the same appliance. Open vStorage is none of the previous. With Open vStorage, the hamburger will be topped with barbecue sauce (compute and Tier 1 inside the same host) but you get the french fries on the side.

As said, Open vStorage should not be used as a hyperconverged solution like Nutanix or Simplivity. The Open vStorage software allows to be used like that but we at CloudFounders don’t believe hyperconverged is the right way to build scalable solutions. We believe a converged solution with Tier 1 inside the compute, let’s call it flexi-converged, is a much better fit for multiple reasons:

  • Storage growth: typically storage needs grows 3 times faster than CPU needs. So adding more compute (CPU & RAM, hypervisors) just because you need more Tier 2 backend storage is just throwing away money. If you go to a hamburger restaurant and you want more french fries, you just order another portion of fries. It just doesn’t make sense to order another hamburger (with french fries as topping) if you only want french fries.
  • Storage performance: since a hyperconverged appliance only has a limited amount of bays, you have to decide between adding an SSD or a SATA drive in a bay. You need the SSDs for performance so that limits the available bays for capacity optimized SATA disks. A hyperconverged appliance makes a trade-off between storage performance (more flash) and storage capacity (more SATA). As a result you end up with appliances costing $180,000! which can run 100 Virtual Machines but can store only a total amount of 5TB (15TB raw) worth of data. Due to the 3-way replication, storing all data 3 times for redundancy reasons, the balance is completely off: each Virtual Machine can only have 50GB of data! What you want is to be able to scale both storage capacity and storage performance independently. Let’s make it clearer. When you order the ‘hyperconverged burger’ you get a burger with barbecue sauce and french fries on top of the burger. Since every burger has a certain size, there is a limit to the amount of french fries and barbecue sauce you can add as topping to the burger. If you want more french fries, you will have to cut back on the barbecue sauce. It is as simple as that. With Open vStorage, the french fries are on the side so you can order as many additional portions as needed. With the Seagate Kinetic integration you can simply add the additional drives to your pool of Tier 2 backend storage, et voilà, you have more space for Virtual Machine data without having to sacrifice storage performance.
  • Performance of the backend: when implementing a Tiered architecture, you don’t want your Tier 2 storage layer (‘the cold storage’) to limit the performance of your Tier 1 layer (‘the caching layer’). The Tier 1 is expensive and optimized for storage performance by using SSDs or PCIe flash so it is a big issue if the speed of the Tier 2 storage becomes a bottleneck to digest data coming from Tier 1. The performance of the Tier 2 storage is determined by the amount of disks and their speed. This is why you see hyperconverged models using two 1TB disks instead of a single 2TB disk. They need the spindles in the backend to make sure the Tier 1 caching layer isn’t impacted by a choking backend. This is a real issue. At CloudFounders we have had situations in the past where had to add disks to the backend just to make sure it could digest what is coming from the cache. Let’s do the math to explain the issue in more detail. Your Tier 1 can easily do 50-70K IOPS of 4K blocks. Let’s assume that this is a mix: 20K write IOPS and 50K read IOPS. The SSD/PCIe flash card will take the first hit for these 20K write IOPS (which is a piece of cake for a flash technology) but once data is evicted from that SSD it needs to go to the backend. Storage solutions will typically do some aggregation of those 4K writes into bigger chunks (Nutanix creates 1MB (4k*250) chunks, Open vStorage accumulates 1000 writes into objects of 4MB) to minimize the backend traffic. So Nutanix needs to store in the optimal scenario 80 IOPS (20K/250) to the backend. This is the optimal scenario as they don’t work with a append-style log but we will devote another blog post to this. Nutanix uses 3 way replication so 80 IOPS become 240 IOPS across multiple disks. These disks contain a file system so there is some additional IO overhead as each hop in the directory structure is another IO. Let’s assume for the sake of simplicity that we only have to go down 1 directory but it could be more hops. So in total to store the 80 IOPS coming out of the cache , you need at least 480 backend IOPS to store it on disk. A normal SATA disk does 90 IOPS so you see that these backend disks become a bottleneck real quickly. In our simple use case we would at least need 6 drives to make sure we can accommodate the data coming from the cache. If among the read IO, which we didn’t take into account, there is also cache misses, those 6 drives will not be enough. It is really painful and costly to add additional SATA disks to your backend which is only 20% full just to make sure you have the spindles to accommodate the data coming from your Tier 1. This is also why Open vStorage likes the Seagate Kinetic drives. The Kinetic drives don’t have a file system so for Open vStorage they are an IOPS saver. If you take the same amount of SATA drives and Seagate Kinetic drives, the Kinetic drives will outperform the SATA drives in our use case. Although Open vStorage supports Ceph and Swift, which use a file system on their OSD, that is why we prefer the Kinetic drives as they provide better performance for the same amount of drives. The Seagate Kinetic drives really are a valuable asset to our portfolio of supported backends.
  • Replace broken disks: the trend is to replace bigger chunks of hardware when they fail. Google, a company hyperconverged solutions like to refer to, has been doing it for years. They no longer care about a broken disk and replace complete servers. Those big storage arrays are made to leave dead disks behind and add a new nodes and only replace the node once X% has failed. You don’t want to go to the datacenter every time a disk fails, with hyperconverged appliances you simply can’t risk leaving a dead disk as you need the spindles for the backend performance. Storage maintenance also mains you need to move VMs of that host which is always a risk.

So let’s look back at what we learned:

  • The lesson learned from converged solutions is that a single SKU makes sense. You have a single point of contact to praise or blame.
  • The lesson learned from hyperconverged solutions is that having your caching layer inside the host is the best solution. Keeping the compute and read and write IO as close as possible makes sense. Having your cold storage inside the same appliance isn’t a good idea for reasons of scalability and performance.
  • Open vStorage keeps these these lessons in mind: it keeps the Tier 1 inside the compute host but allows to scale storage performance and capacity independently. Using the Seagate Kinetic drives as Tier 2 storage makes sense as it is an easy way to increase the backend storage performance.

To summarize, a converged solution with Tier 1 in the host and a scalable backend on top of Kinetic drives is in every aspect a much better solution compared to a traditional converged or hyperconverged solution if you want to build a cost-effective, scalable solution. The world has been making hamburgers for more than 100 years and we came to to conclusion that having the french fries on the side is the best option. By putting the fries as topping on the burger you are in for a mess so in that spirit let’s also not do it with our compute and (cold) storage.

Open vStorage by CloudFounders

basementIn a recent conference call an attendee expressed the following:

There is a real company behind Open vStorage? I thought this was a project done by 2 guys in their basement.

There is a big misconception about open-source projects. Some of these projects are indeed started and maintained by 2 guys in their basement. But on the other hand you see more and more projects where a couple of hundred people contribute. Take as an example OpenStack. To this open-source project companies such as Red Hat, IBM, HP, Rackspace, SwiftStack, Mirantis, Intel and many more are contributing code and are actually paying people to work on the project.

Open vStorage is a similar project being backed up by a real company: CloudFounders. At CloudFounders we love to build technology. People working for CloudFounders have done this for companies such as Oracle/Sun, Symantec, Didigate/Verizon, Amplidata and many more leading technology companies. We have also been active in the open-source community with projects such as Arakoon, our distributed key-value store.

The technology behind Open vStorage is not something we wrapped together over the last 6 months by gluing some open-source components together and being coated with a nice management layer. The core technology, which basically turns a bucket on your favorite object store into a raw device, is developed from scratch by the CloudFounders R&D and engineering team. We have been working for more than 4 years on the core. We have used the technology in our commercial product, vRun, but decided the best way forward is to open-source the technology. We believe software -defined storage is too important a piece of the virtualization stack for a proprietary solution that is either hypervisor specific, hardware specific, management stack specific and storage backend specific. With Open vStorage we want to build an open and non-proprietary storage layer but foremost something modular enough which allows developers to innovate on top of Open vStorage.

PS. According to Ohloh, Open vStorage has had 1,384 commits made by 14 contributors representing 55,404 lines of code!