SCOs , chunks & fragments

For frequent readers it is stating the obvious to say that ALBA is a complex piece of software. One of the most dark caves of the ALBA OCaml code is the one where SCOs, the objects coming from the Volume Driver, are split into objects. These objects are subsequently stored on the ASDs in the ALBA backend. It is time to clear up the mist around policies, SCOs, chunks and fragments as uncareful setting of these values might result in performance loss or an explosion of the backend metadata.

The fragment basics

Open vStorage uses an append-only strategy for data written to a volume. Once enough data is accumulated, the Volume Driver hands the log-file, a SCO (Storage Container Object), over to the ALBA proxy. This ALBA proxy is responsible for encrypting, compressing and erasure coding or replicating the SCOs based upon the selected preset. One important part of the preset is the policy (k, m, c, x). These 4 numbers can have a great influence on the performance of your Open vStorage cluster. But for starters, let’s first recap the meaning of these 4 numbers:

  • k: the amount of data fragments
  • m: the amount of parity fragments
  • c: the minimum number of fragments been written before the write is acknowledged
  • x: the maximum number of fragments per storage node

When c is lower than k+m, one or more slow responding ASDs won’t have impact on the write performance to the backend. The fragments which should have been stored on the slow ASD(s) will simply be rewritten at a later point in time by the maintenance process.

This was the easy part of how these numbers can influence the performance. Now comes the hard part. When you have a SCO of let’s say 64MB it is according to the policy split into k data objects and m parity objects. Assume k is set to 8 and hence we should end up with 8 objects of 8MB. There is however another (hidden) value which plays a role: the maximum fragment size. The fragment size does have an impact on the write performance as larger fragments tend to provide higher write bandwidth to the underlying hard disk. It is not a secret that traditional SATA disks love large pieces of consecutive data to write. But on the other hand, the bigger the fragments are, the less relevant they are to cache in the fragment cache and the longer it takes to read them from the backend in case of cache misses. To summarize, the size of the fragments should be big but not too big.

So to make sure fragments are not too big you can set a maximum fragment size. The default maximum fragment size is 4MB. As the fragment size in the example above was 8MB and the maximum fragment size for the backend is only 4MB something will need to happen: chunking. Chunking splits large SCOs into smaller chunks so the fragments of these chunks are smaller than the maximum fragment size. So in our example above the SCO will be split in smaller chunks. To calculate the amount of chunks needed, a simple formula can be used:

Amount of chunks = ROUNDUP(SCO size/min(k*maximum fragment size,SCO size))

In the our example we end up with 2 chunks – roundup(64/min(8*4,64). These 2 chunks are next erasure coded using the (k, m, c, x) policy. Basically you end up with 2 chunks of 8 4MB fragments and per chunk an additional m parity fragments.

Global Backends

So far we only covered the fragment basics so let’s make it a bit more complex by introducing stacked backends. Open vStorage allows multiple local backends to be combined into a global backend. This means there are 2 sets of fragments: the fragments at the global level and at the local level. Let’s continue with our previous example where we had 64MB SCOs and a 4MB fragment size. This means that the fragments which serve as input for the local backends are only 4MB. Assume that we also configure erasure coding with policy (k’, m’, c’, x’) at the local backend level. In that case each 4MB fragment will be split into another k’ fragments and m’ parity fragments. If k’ is for example set to 8, you will end up with 512KB fragments. There are 2 issues with this relatively small size of the fragments. The first issue was already outlined above. Traditional SATA drives are optimized for large chunks of consecutive data and 512KB is probably too small to reach the hard disks’ write bandwidth limit. This means we have suboptimal write performance. The second issue is related to the metadata size. Each object in the ALBA backend is referenced by metadata and in order to optimize the performance all metadata should be kept in RAM. Hence it is essential to keep the data/metadata ratio as high as possible in order to keep the required RAM to address the whole backend under control. In the above example with an (8, 2, c, x) policy for both the global and local backend we would end up with around 10KB of metadata for every 64MB SCO. With an optimal selection of the global policy (4,1, c, x) and a maximum fragment size of 16MB on the global backend, the metadata for the same SCO is only 5KB. This means that with the same amount of RAM reserved for the metadata, twice the amount of backend storage can be addressed. Next to storing the metadata in RAM, the metadata is also persistently store d on disk (NVMe, SSD) in an Arakoon cluster. By default Arakoon uses a 3-way replication scheme so with the optimized settings the metadata will occupy 6 time less disk space. The optimal global policy of (4,1, c, x) will, next to a lower memory footprint for the metadata, also provide better performance as 4MB fragments are written to the SATA drives instead of the smaller 512KB fragments.


Whatever you decide as ABLA backend policy, SCO size and maximum fragment size, choose wisely as these values have an impact on various aspects of the Open vstorage cluster ranging from performance to Total Cost of Ownership (TCO).