I like to move it, move it

The vibe at the Open vStorage office is these days best explained by a song of the early nineties:

I like to move it, move it ~ Reel 2 Reel

While the summer time is in most companies a more quiet time, the Open vStorage office is buzzing like a beehive. Allow me to give you a short overview of what is happening:

  • We are moving into our new, larger and stylish offices. The address remains the same but we are moving into a completely remodeled floor of the Idola business center.
  • Next to physically moving desks at the Open vStorage HQ, we are also moving our code from BitBucket to GitHub. We have centralized all our code under https://github.com/openvstorage. To list a few of the projects: Arakoon (our consistent distributed key-value store), ALBA (the Open vStorage default ALternate BAckend) and of course Open vStorage itself. Go check it out!
  • Finishing up our Open vStorage 2.2 GA release.
  • Adding support for RedHat and Cent OS by merging in the Cent-OS branch. There is still some work to do around packaging, testing and upgrades so feel free to give a hand. As this was really a community effort, we owe everyone a big thank you.
  • Working on some very cool features (RDMA anyone?) but let’s keep those for a separate post.
  • Preparation for VMworld (San Francisco) and the OpenStack summit in Tokyo.

As you can see, many things going on at once so prepare for a hot Open vStorage fall!

Int32 serialization in OCaml

Today, I’m going to talk a bit about the problem of serializing an int32 in OCaml. As I’m only working on Intel machines, I’m not interested in portability, and prefer little-endian serialization. This should be natural and easy.

The interface

val set32: string -> int -> int32 -> unit
val get32: string -> int -> int32

The microbenchmark

We’re going to store an int32 into a string,retrieve it, and check if it’s the same. We’re going to do this 1_000_000_000 times, see how long it took, and calculate the speed.

let benchmark n =
  let t0 = Unix.gettimeofday() in
  let s = String.create 4 in
  let limit = Int32.of_int n in
  let rec loop i32 =
    if i32 = limit
    then ()
    else
      let () = set32 s 0 i32 in
      let j32 = get32 s 0 in
      assert (i32 = j32);
      loop (Int32.succ i32)
  in
  let () = loop 0l in
  let t1 = Unix.gettimeofday () in
  let d = t1 -. t0 in
  let speed = float n /. d in
  let megaspeed = speed /. 1000000.0 in
  Printf.printf "%i took %f => %fe6/s\n" n d megaspeed


Attempt 0: Naive implementation

This is rather straight forward: mask, extract the char, store, shift and repeat. Retrieving the int32 from the string is the opposite. No rocket surgery here.
This is simple, readable code.

let set32_ocaml s pos (i:int32) =
  let (>:) = Int32.shift_right_logical in
  let (&:) = Int32.logand in
  let mask = Int32.of_int 0xff in
  let to_char v = Char.chr (Int32.to_int v) in
  let too_far = pos + 4 in
  let rec loop p i =
    if p = too_far
    then ()
    else
      let vp = i &: mask in
      let cp = to_char vp in
      let () = s.[p] <- cp in
      loop (p+1) (i >: 8)
  in
  loop pos i


let get32_ocaml s pos =
  let (<:) = Int32.shift_left in
  let (|:) = Int32.logor in
  let to_i32 c = Int32.of_int (Char.code c) in
  let rec loop acc p =
    if p < pos
    then acc
    else
      let cp = s.[p] in
      let vp = to_i32 cp in
      let acc' = (acc <: 8) |: vp in
      loop acc' (p-1)
  in
  loop 0l (pos + 3)

OCaml is a nice high level language, but this bit twiddling feels rather clumsy and ugly.
Anyway, let’s benchmark it.

Strategy Speed
naive OCaml 16.0e6/s

A quick peek at how Thrift does it

let get_byte32 i b = 255 land (Int32.to_int (Int32.shift_right i (8*b)))
class trans = object(self)
  val ibyte = String.create 8
  ...
  method writeI32 i =
    let gb = get_byte32 i in
    for i=0 to 3 do
      ibyte.[3-i] <- char_of_int (gb i)
    done;
    trans#write ibyte 0 4

Ok, this uses the same strategy; but there’s a for loop there. The conversion is done in the ibyte buffer and then copied along. It’s a bit sub-awesome, but the extra copy of 4 bytes shouldn’t be too costly neither.

Attempt 1: But in C, it would be way faster

It’s a platitude I hear a lot, but in this case, it really should be faster. After all, if you want to retrieve an int32 from a string, all you need to do is to cast the char* to an int32_t* and de-reference the value.

Let’s try this:

external set32 : string -> int -> int32 -> unit = "zooph_set32"
external get32 : string -> int -> int32         = "zooph_get32"
#include <stdint.h>
#include <stdio.h>
#include <caml/alloc.h>
#include <caml/memory.h>
#include <caml/mlvalues.h>

value zooph_set32(value vs, value vpos, value vi){
  CAMLparam3(vs, vpos, vi);
  char* buf = String_val(vs);
  int pos = Int_val(vpos);
  int32_t i = Int32_val(vi);

  char* buf_off = &buf[pos];
  int32_t* casted = (int32_t*)buf_off;
  casted[0] = i;
  CAMLreturn (Val_unit);
}

value zooph_get32(value vs,value vpos){
    CAMLparam2(vs,vpos);
    CAMLlocal1(result);
    char* buf = String_val(vs);
    int pos = Int_val(vpos);
    char* buf_off = &buf[pos];
    int32_t* casted = (int32_t*)buf_off;
    int32_t i32 = casted[0];
    result = caml_copy_int32(i32);
    CAMLreturn(result);
}

I called my compilation unit zooph.c an onomatopoeia that pays tribute to how fast I expect this to be. There’s no loop, and the machine has the skills to do the transformation in one step. So it should roughly be about 4 times faster.
Let’s benchmark it.

Strategy Speed
naive OCaml 16.0e6
C via FFI 32.3e6

Hm… it’s faster allright, but it’s also a bit disappointing. So what went wrong?

A quick look at the assembly code reveals a lot:

zooph_set32:
.LFB34:
	.cfi_startproc
	movl	8(%rdx), %eax
	sarq	%rsi
	movslq	%esi, %rsi
	movl	%eax, (%rdi,%rsi)
	movl	$1, %eax
	ret
	.cfi_endproc
.LFE34:
	.size	zooph_set32, .-zooph_set32
	.p2align 4,,15
	.globl	zooph_get32
	.type	zooph_get32, @function
zooph_get32:
.LFB35:
	.cfi_startproc
	pushq	%rbx
	.cfi_def_cfa_offset 16
	.cfi_offset 3, -16
	movq	%rsi, %rdx
	sarq	%rdx
	subq	$160, %rsp
	.cfi_def_cfa_offset 176
	movslq	%edx, %rdx
	movq	caml_local_roots(%rip), %rbx
	leaq	8(%rsp), %rcx
	movq	%rdi, 8(%rsp)
	movl	(%rdi,%rdx), %edi
	movq	%rsi, (%rsp)
	movq	$1, 32(%rsp)
	movq	%rcx, 40(%rsp)
	leaq	(%rsp), %rcx
	movq	%rbx, 16(%rsp)
	movq	$2, 24(%rsp)
	movq	$0, 152(%rsp)
	movq	%rcx, 48(%rsp)
	leaq	16(%rsp), %rcx
	movq	$1, 96(%rsp)
	movq	$1, 88(%rsp)
	movq	%rcx, 80(%rsp)
	leaq	80(%rsp), %rcx
	movq	%rcx, caml_local_roots(%rip)
	leaq	152(%rsp), %rcx
	movq	%rcx, 104(%rsp)
	call	caml_copy_int32
	movq	%rbx, caml_local_roots(%rip)
	addq	$160, %rsp
	.cfi_def_cfa_offset 16
	popq	%rbx
	.cfi_def_cfa_offset 8
	ret
	.cfi_endproc

While zooph_set32 seems to be in order, its counter part is rather messy. On closer inspection, not even the set32 side is optimal. OCaml’s FFI allows smooth (at least compared to jni) interaction with native code in other languages, it also installs a firm border across which no inlining is possible (not with OCaml that is).

Let’s take a look at how the benchmark code calls this.

.L177:
	movq	%rbx, 8(%rsp)
	movq	%rax, 0(%rsp)
	movq	$1, %rsi
	movq	16(%rbx), %rdi
	movq	%rax, %rdx
	movq	zooph_set32@GOTPCREL(%rip), %rax
	call	caml_c_call@PLT
.L179:
	movq	caml_young_ptr@GOTPCREL(%rip), %r11
	movq    (%r11), %r15
	movq	$1, %rsi
	movq	8(%rsp), %rax
	movq	16(%rax), %rdi
	movq	zooph_get32@GOTPCREL(%rip), %rax
	call	caml_c_call@PLT
.L180:
	movq	caml_young_ptr@GOTPCREL(%rip), %r11
	movq    (%r11), %r15
	movslq	8(%rax), %rax
	movq	0(%rsp), %rdi
	movslq	8(%rdi), %rbx
	cmpq	%rax, %rbx
	je	.L176

You see stuff being pushed on the stack before the call. For raw speed, you don’t want this to happen. For raw speed, you don’t even want a call.
To get there, you need to translate the benchmark to C too. I’m not going to bother, because I have another trick ready.

Attempt 2: OCaml 4.01 primitives

OCaml 4.01 got released recently, and there’s a little entry in the release notes.

PR#5771: Add primitives for reading 2, 4, 8 bytes in strings and bigarrays
(Pierre Chambart)

However, for some reason, they are not really exposed, and I had to dig to find them. Using them however is trivial.

external get32_prim : string -> int -> int32         = "%caml_string_get32"
external set32_prim : string -> int -> int32 -> unit = "%caml_string_set32"

That’s all there is to it. Basically, you say that you know that the compiler knows how to do this, and that from now on, you want to do that too.
Let’s benchmark it.

Strategy Speed
naive OCaml 16.0e6
C via FFI 32.3e6
OCaml with primitives 139e6

Wow.

Closing words

I’ve put the code for this on github: https://github.com/toolslive/int32_blog Anyway, we need to (de)serialize int64 values as well. Determining the speedup there is left as an exercise for the reader (tip: it’s even better).

I think some people will feel the urge to apply this to their serialization code as well.

Have fun,

Romain.

Open vStorage 2.2 alpha 4

We released Open vStorage 2.2 Alpha 4 which contains following bugfixes:

  • Update of the About section under Administration.
  • Open vStorage Backend detail page hangs in some cases.
  • Various bugfixes for the use case when adding a vPool with a vPool name which was previously used.
  • Hardening the vPool removal.
  • Fix daily scrubbing not running.
  • No log output from the scrubber.
  • Failing to create a vDisk from a snapshot tries to delete the snapshot.
  • ALBA discovery starts spinning if network is not available.
  • ASD is no longer used by the proxy even after it has been requalified.
  • Type checking through Descriptor doesn’t work consistently.

Open vStorage 2.2 alpha 3

Today we released Open vStorage 2.2 alpha 3. The only new features are on the Open vStorage Backend (ALBA) front:

  • Metadata is now stored with a higher protection level.
  • The protocol of the ASD is now more flexible in the light of future changes.

Bugfixes:

  • Make it mandatory to configure both read- and writecache during the ovs setup partitioner.
  • During add_vpool on devstack, the cinder.conf is updated with notification_driver which is incorrectly set as “nova.openstack.common.notifier.rpc_notifier” for Juno.
  • Added support for more physical disk configuration layouts.
  • ClusterNotReachableException during vPool changes.
  • Cannot extend vPool with volumes running.
  • Update button clickable when an update is ongoing.
  • Already configured storage nodes are now removed from the discovered ones.
  • Fix for ASDs which don’t start.
  • Issue where a slow long-running task could fail because of a timeout.
  • Message delivery from albamgr to nsm_host can get stuck.
  • Fix for ALBA Namespace doesn’t exists while it exists.

Open vStorage 2.2 alpha 2

As promised in our latest release update we would do more frequent releases. Et voilà , today we release a new alpha version of the upcoming GA release. If possible, we will provide new versions from now on as an update so that you don’t have to reinstall your Open vStorage environment. As this new version is the first one with the update/upgrade functionality, there is no update possible between alpha 1 and alpha 2.

What is new in Open vStorage 2.2 alpha 2:

  • Upgrade functionality: Under the Administration section you can check for updates of the Framework, the Volumedriver and the Open vStorage Backend and apply them. For the moment an update might require all VM’s to be shutdown.
  • Support for non-identical hardware layouts: You can now mix hardware which doesn’t have the same amount of SSDs or PCI-e flash cards. When extending a vPool to a new Storage Router you can select which devices to use as cache.

Small Features:

  • The Backend policy which defines how SCOs are stored on the Backend can now be changed. The wizard is straightforward in case you want to set for example 3-way replication.
  • Rebalancing of the Open vStorage Backend, moving data from disks which are almost full to new disks to make sure all disks are evenly used, is now a service which can be enabled and disabled.
  • Audit trails are no longer stored in the model but in a log.

Bug fixes:

  • ASD raises time out or stops under heavy load.
  • Extening a vPool to a new Storage Router doesn’t require the existing vMachines on the vPool to be stopped.
  • FailOverCache does not exit but hangs in accept in some cases.
  • Remove vpool raises cluster not reachable exception.
  • Add logrotate entry for /var/log/arakoon/*/*.log
  • Error in vMachine detail (refreshVDisks is not found).
  • Arakoon 1.8.4 rpm doesn’t work on CentOS7.
  • Arakoon catchup is quite slow.
  • The combination Openstack with multiple vPools and live migration does not properly work in some cases.

vDisks, vMachines, vPools and Backends: how does it all fit together

With the latest version of Open vStorage, we released the option to use physical SATA disks as storage backend for Open vStorage. These disks can be inside the hypervisor host, hyper-converged, or in a storage server, an x86 server with SATA disks*. Together with this functionality we introduced some new terminology so we thought it would be a good idea to give an overview of how it all fits together.

vPool - Backend

Let’s start from the bottom, the physical layer, and work up to the virtual layer. For the sake of simplicity we assume a host has one or more SATA drives. With the hyper-converged version (openvstorage-hc) you can unlock functionality to manage these drives and assign them to a Backend. Basically a Backend is nothing more than a collection of physical disks grouped together. These disks can even be spread across multiple hosts. The implementation even gives you the freedom to assign some physical disks in a host to one Backend and the other disks to a second Backend. The only limitation is that a disk can only be assigned to a single Backend at the same time. The benefit of having this Backend concept allows to separate customers even on physical disk level by assigning each of them its own set of hard drives.

On top of a Backend you create one more vPools. You could compare this with creating a LUN on a traditional SAN. The benefit of having the split between a Backend and a vPool allows to have additional flexibility:

  • You can assign a vPool per customer or per department. This means you can for example bill per used GB.
  • You can set a different encryption passphrase per vPool.
  • You can enable or disable compression per vPool.
  • You can set the replication factor per vPool. For example a “test” vPool, where you store only a single copy of the Storage Container Objects (SCOs) but for production servers you can configure 3-way replication.

On top of the vPool, configured as a Datastore in VMware or a mountpoint in KVM, you can create vMachines. Each vMachine can have multiple vDisks, Virtual disks. For now vmdk and raw files are supported.

*A quick how-to convert a server with only SATA drives into a storage server you can use as Backend:
On the hypervisor host:

Install Open vStorage (apt-get install openvstorage-hc) as explained in the documentation and run the ovs setup command.

On the storage server:

  • The storage server must have an OS disk and at least 3 empty disks for the storage backend.
  • Setup the storage server with Ubuntu LTS 14.04.2. Make sure the storage server is in the same network as the compute host.
  • Execute following to setup the program that will manage the disks of the storage server:
    echo "deb http://apt-ovs.cloudfounders.com beta/" > /etc/apt/sources.list.d/ovsaptrepo.list
    apt-get update
    apt-get install openvstorage-sdm
  • Retrieve the automatically generated password from the config:
    cat /opt/alba-asdmanager/config/config.json
    ...
    "password": "u9Q4pQ76e0VJVgxm6hasdfasdfdd",
    ...

On the hypervisor host:

  • Login and go to Backends.
  • Click the add Backend button, specify a name and select Open vStorage Backend as type. Click Finish and wait until the status becomes green.
  • In the Backend details page of the freshly created Backend, click the Discover button. Wait until the storage server pops up and click on the + icon to add it. When asked for credentials use root as login and the password retrieved above.
  • Next follow the standard procedure to claim the disks of the storage server and add them to the Backend.