Int32 serialization in OCaml

Today, I’m going to talk a bit about the problem of serializing an int32 in OCaml. As I’m only working on Intel machines, I’m not interested in portability, and prefer little-endian serialization. This should be natural and easy.

The interface

val set32: string -> int -> int32 -> unit
val get32: string -> int -> int32

The microbenchmark

We’re going to store an int32 into a string,retrieve it, and check if it’s the same. We’re going to do this 1_000_000_000 times, see how long it took, and calculate the speed.

let benchmark n =
  let t0 = Unix.gettimeofday() in
  let s = String.create 4 in
  let limit = Int32.of_int n in
  let rec loop i32 =
    if i32 = limit
    then ()
      let () = set32 s 0 i32 in
      let j32 = get32 s 0 in
      assert (i32 = j32);
      loop (Int32.succ i32)
  let () = loop 0l in
  let t1 = Unix.gettimeofday () in
  let d = t1 -. t0 in
  let speed = float n /. d in
  let megaspeed = speed /. 1000000.0 in
  Printf.printf "%i took %f => %fe6/s\n" n d megaspeed

Attempt 0: Naive implementation

This is rather straight forward: mask, extract the char, store, shift and repeat. Retrieving the int32 from the string is the opposite. No rocket surgery here.
This is simple, readable code.

let set32_ocaml s pos (i:int32) =
  let (>:) = Int32.shift_right_logical in
  let (&:) = Int32.logand in
  let mask = Int32.of_int 0xff in
  let to_char v = Char.chr (Int32.to_int v) in
  let too_far = pos + 4 in
  let rec loop p i =
    if p = too_far
    then ()
      let vp = i &: mask in
      let cp = to_char vp in
      let () = s.[p] <- cp in
      loop (p+1) (i >: 8)
  loop pos i

let get32_ocaml s pos =
  let (<:) = Int32.shift_left in
  let (|:) = Int32.logor in
  let to_i32 c = Int32.of_int (Char.code c) in
  let rec loop acc p =
    if p < pos
    then acc
      let cp = s.[p] in
      let vp = to_i32 cp in
      let acc' = (acc <: 8) |: vp in
      loop acc' (p-1)
  loop 0l (pos + 3)

OCaml is a nice high level language, but this bit twiddling feels rather clumsy and ugly.
Anyway, let’s benchmark it.

Strategy Speed
naive OCaml 16.0e6/s

A quick peek at how Thrift does it

let get_byte32 i b = 255 land (Int32.to_int (Int32.shift_right i (8*b)))
class trans = object(self)
  val ibyte = String.create 8
  method writeI32 i =
    let gb = get_byte32 i in
    for i=0 to 3 do
      ibyte.[3-i] <- char_of_int (gb i)
    trans#write ibyte 0 4

Ok, this uses the same strategy; but there’s a for loop there. The conversion is done in the ibyte buffer and then copied along. It’s a bit sub-awesome, but the extra copy of 4 bytes shouldn’t be too costly neither.

Attempt 1: But in C, it would be way faster

It’s a platitude I hear a lot, but in this case, it really should be faster. After all, if you want to retrieve an int32 from a string, all you need to do is to cast the char* to an int32_t* and de-reference the value.

Let’s try this:

external set32 : string -> int -> int32 -> unit = "zooph_set32"
external get32 : string -> int -> int32         = "zooph_get32"
#include <stdint.h>
#include <stdio.h>
#include <caml/alloc.h>
#include <caml/memory.h>
#include <caml/mlvalues.h>

value zooph_set32(value vs, value vpos, value vi){
  CAMLparam3(vs, vpos, vi);
  char* buf = String_val(vs);
  int pos = Int_val(vpos);
  int32_t i = Int32_val(vi);

  char* buf_off = &buf[pos];
  int32_t* casted = (int32_t*)buf_off;
  casted[0] = i;
  CAMLreturn (Val_unit);

value zooph_get32(value vs,value vpos){
    char* buf = String_val(vs);
    int pos = Int_val(vpos);
    char* buf_off = &buf[pos];
    int32_t* casted = (int32_t*)buf_off;
    int32_t i32 = casted[0];
    result = caml_copy_int32(i32);

I called my compilation unit zooph.c an onomatopoeia that pays tribute to how fast I expect this to be. There’s no loop, and the machine has the skills to do the transformation in one step. So it should roughly be about 4 times faster.
Let’s benchmark it.

Strategy Speed
naive OCaml 16.0e6
C via FFI 32.3e6

Hm… it’s faster allright, but it’s also a bit disappointing. So what went wrong?

A quick look at the assembly code reveals a lot:

	movl	8(%rdx), %eax
	sarq	%rsi
	movslq	%esi, %rsi
	movl	%eax, (%rdi,%rsi)
	movl	$1, %eax
	.size	zooph_set32, .-zooph_set32
	.p2align 4,,15
	.globl	zooph_get32
	.type	zooph_get32, @function
	pushq	%rbx
	.cfi_def_cfa_offset 16
	.cfi_offset 3, -16
	movq	%rsi, %rdx
	sarq	%rdx
	subq	$160, %rsp
	.cfi_def_cfa_offset 176
	movslq	%edx, %rdx
	movq	caml_local_roots(%rip), %rbx
	leaq	8(%rsp), %rcx
	movq	%rdi, 8(%rsp)
	movl	(%rdi,%rdx), %edi
	movq	%rsi, (%rsp)
	movq	$1, 32(%rsp)
	movq	%rcx, 40(%rsp)
	leaq	(%rsp), %rcx
	movq	%rbx, 16(%rsp)
	movq	$2, 24(%rsp)
	movq	$0, 152(%rsp)
	movq	%rcx, 48(%rsp)
	leaq	16(%rsp), %rcx
	movq	$1, 96(%rsp)
	movq	$1, 88(%rsp)
	movq	%rcx, 80(%rsp)
	leaq	80(%rsp), %rcx
	movq	%rcx, caml_local_roots(%rip)
	leaq	152(%rsp), %rcx
	movq	%rcx, 104(%rsp)
	call	caml_copy_int32
	movq	%rbx, caml_local_roots(%rip)
	addq	$160, %rsp
	.cfi_def_cfa_offset 16
	popq	%rbx
	.cfi_def_cfa_offset 8

While zooph_set32 seems to be in order, its counter part is rather messy. On closer inspection, not even the set32 side is optimal. OCaml’s FFI allows smooth (at least compared to jni) interaction with native code in other languages, it also installs a firm border across which no inlining is possible (not with OCaml that is).

Let’s take a look at how the benchmark code calls this.

	movq	%rbx, 8(%rsp)
	movq	%rax, 0(%rsp)
	movq	$1, %rsi
	movq	16(%rbx), %rdi
	movq	%rax, %rdx
	movq	zooph_set32@GOTPCREL(%rip), %rax
	call	caml_c_call@PLT
	movq	caml_young_ptr@GOTPCREL(%rip), %r11
	movq    (%r11), %r15
	movq	$1, %rsi
	movq	8(%rsp), %rax
	movq	16(%rax), %rdi
	movq	zooph_get32@GOTPCREL(%rip), %rax
	call	caml_c_call@PLT
	movq	caml_young_ptr@GOTPCREL(%rip), %r11
	movq    (%r11), %r15
	movslq	8(%rax), %rax
	movq	0(%rsp), %rdi
	movslq	8(%rdi), %rbx
	cmpq	%rax, %rbx
	je	.L176

You see stuff being pushed on the stack before the call. For raw speed, you don’t want this to happen. For raw speed, you don’t even want a call.
To get there, you need to translate the benchmark to C too. I’m not going to bother, because I have another trick ready.

Attempt 2: OCaml 4.01 primitives

OCaml 4.01 got released recently, and there’s a little entry in the release notes.

PR#5771: Add primitives for reading 2, 4, 8 bytes in strings and bigarrays
(Pierre Chambart)

However, for some reason, they are not really exposed, and I had to dig to find them. Using them however is trivial.

external get32_prim : string -> int -> int32         = "%caml_string_get32"
external set32_prim : string -> int -> int32 -> unit = "%caml_string_set32"

That’s all there is to it. Basically, you say that you know that the compiler knows how to do this, and that from now on, you want to do that too.
Let’s benchmark it.

Strategy Speed
naive OCaml 16.0e6
C via FFI 32.3e6
OCaml with primitives 139e6


Closing words

I’ve put the code for this on github: Anyway, we need to (de)serialize int64 values as well. Determining the speedup there is left as an exercise for the reader (tip: it’s even better).

I think some people will feel the urge to apply this to their serialization code as well.

Have fun,