SCOs , chunks & fragments

For frequent readers it is stating the obvious to say that ALBA is a complex piece of software. One of the most dark caves of the ALBA OCaml code is the one where SCOs, the objects coming from the Volume Driver, are split into objects. These objects are subsequently stored on the ASDs in the ALBA backend. It is time to clear up the mist around policies, SCOs, chunks and fragments as uncareful setting of these values might result in performance loss or an explosion of the backend metadata.

The fragment basics

Open vStorage uses an append-only strategy for data written to a volume. Once enough data is accumulated, the Volume Driver hands the log-file, a SCO (Storage Container Object), over to the ALBA proxy. This ALBA proxy is responsible for encrypting, compressing and erasure coding or replicating the SCOs based upon the selected preset. One important part of the preset is the policy (k, m, c, x). These 4 numbers can have a great influence on the performance of your Open vStorage cluster. But for starters, let’s first recap the meaning of these 4 numbers:

  • k: the amount of data fragments
  • m: the amount of parity fragments
  • c: the minimum number of fragments been written before the write is acknowledged
  • x: the maximum number of fragments per storage node

When c is lower than k+m, one or more slow responding ASDs won’t have impact on the write performance to the backend. The fragments which should have been stored on the slow ASD(s) will simply be rewritten at a later point in time by the maintenance process.

This was the easy part of how these numbers can influence the performance. Now comes the hard part. When you have a SCO of let’s say 64MB it is according to the policy split into k data objects and m parity objects. Assume k is set to 8 and hence we should end up with 8 objects of 8MB. There is however another (hidden) value which plays a role: the maximum fragment size. The fragment size does have an impact on the write performance as larger fragments tend to provide higher write bandwidth to the underlying hard disk. It is not a secret that traditional SATA disks love large pieces of consecutive data to write. But on the other hand, the bigger the fragments are, the less relevant they are to cache in the fragment cache and the longer it takes to read them from the backend in case of cache misses. To summarize, the size of the fragments should be big but not too big.

So to make sure fragments are not too big you can set a maximum fragment size. The default maximum fragment size is 4MB. As the fragment size in the example above was 8MB and the maximum fragment size for the backend is only 4MB something will need to happen: chunking. Chunking splits large SCOs into smaller chunks so the fragments of these chunks are smaller than the maximum fragment size. So in our example above the SCO will be split in smaller chunks. To calculate the amount of chunks needed, a simple formula can be used:

Amount of chunks = ROUNDUP(SCO size/min(k*maximum fragment size,SCO size))

In the our example we end up with 2 chunks – roundup(64/min(8*4,64). These 2 chunks are next erasure coded using the (k, m, c, x) policy. Basically you end up with 2 chunks of 8 4MB fragments and per chunk an additional m parity fragments.

Global Backends

So far we only covered the fragment basics so let’s make it a bit more complex by introducing stacked backends. Open vStorage allows multiple local backends to be combined into a global backend. This means there are 2 sets of fragments: the fragments at the global level and at the local level. Let’s continue with our previous example where we had 64MB SCOs and a 4MB fragment size. This means that the fragments which serve as input for the local backends are only 4MB. Assume that we also configure erasure coding with policy (k’, m’, c’, x’) at the local backend level. In that case each 4MB fragment will be split into another k’ fragments and m’ parity fragments. If k’ is for example set to 8, you will end up with 512KB fragments. There are 2 issues with this relatively small size of the fragments. The first issue was already outlined above. Traditional SATA drives are optimized for large chunks of consecutive data and 512KB is probably too small to reach the hard disks’ write bandwidth limit. This means we have suboptimal write performance. The second issue is related to the metadata size. Each object in the ALBA backend is referenced by metadata and in order to optimize the performance all metadata should be kept in RAM. Hence it is essential to keep the data/metadata ratio as high as possible in order to keep the required RAM to address the whole backend under control. In the above example with an (8, 2, c, x) policy for both the global and local backend we would end up with around 10KB of metadata for every 64MB SCO. With an optimal selection of the global policy (4,1, c, x) and a maximum fragment size of 16MB on the global backend, the metadata for the same SCO is only 5KB. This means that with the same amount of RAM reserved for the metadata, twice the amount of backend storage can be addressed. Next to storing the metadata in RAM, the metadata is also persistently store d on disk (NVMe, SSD) in an Arakoon cluster. By default Arakoon uses a 3-way replication scheme so with the optimized settings the metadata will occupy 6 time less disk space. The optimal global policy of (4,1, c, x) will, next to a lower memory footprint for the metadata, also provide better performance as 4MB fragments are written to the SATA drives instead of the smaller 512KB fragments.

Conclusion

Whatever you decide as ABLA backend policy, SCO size and maximum fragment size, choose wisely as these values have an impact on various aspects of the Open vstorage cluster ranging from performance to Total Cost of Ownership (TCO).

The Storage and Networking DNA.

Today I want to discuss a less technical but more visionary assertion: the concept that storage and networking share the same DNA. This is especially the case for large scale deployments. This insight surfaced as the rumor of Cisco buying Netapp roared again. Allow to me to explain why I believe exabyte storage clusters and large scale networks have a lot in common.

The parallels in storage and networking:

The first feature both networks and exabyte storage share is that they are highly scalable. Both topologies typically start small and grow overtime. Adding more capacity can be achieved seamlessly by adding more hardware to the cluster. This allows for a higher bandwidth, higher capacity and more users to be served.

Downtime is typically unacceptable for both and SLAs to ensure a multi-nine availability are common. To achieve this level of availability both rely on hyper-meshed, shared nothing architectures. These highly redundant architectures ensure that if one component fails another component takes over. To illustrate, switches typically are used in a redundant fashion as a single server is connected to 2 independent switches. If one switch fails the other one takes over. The same holds for storage. Data is also stored redundant. This could be achieved with replication or erasure coding across multiple disks and servers. If a disk or server would fail, data can still be retrieved from other disks and servers in the storage cluster.

These days you can check your Facebook timeline or Twitter account from almost anywhere in the world. Large scale networks allow users to have access from anywhere in the world. This global network spans across the globe and interlinks different smaller networks. The same holds for storage as we are moving to a world where data is stored in geographically dispersed places and even in different clouds.

With new technologies like Software-Defined Networking (SDN) network management has moved towards a Single point of Governance. Accordingly the physical network can be configured on a high level while the detailed network topology is pushed down to the physical and virtual devices that make up the network. The same trend is happening in the storage industry with Software-Defined Storage (SDS). These software applications allow to configure and manage the physical hardware in the storage cluster, even across multiple devices, sites and even different clouds through a single high-level management view.

A last point I’d like to touch is that for both networking and storage, the hardware brands and models hardly matter as they can all work together due to network standards. The same goes for storage hardware. Different brands of disks, controllers and servers can all be used to build an exabyte storage cluster. Users of the network are not aware of the exact topology of the network (brands, links, routing, …). The same holds for storage. The user shouldn’t know on which disk his data is stored exactly, the only thing he cares about is that he or she gets the right data on time when needed and it is safely stored.

Open vStorage, taking the network analogy to the next step

Let’s have a look at the components of a typical network. On the left we have the consumer of the network, in this case a server. This server is physically connected with the network through a Network Interface Controller (NIC). A NIC driver provides the necessary interfaces for the application on the server to use the network. Data which is sent down the network traverses the TCP-IP stack down to the NIC where data is converted into individual packets. Within the network various components play a specific role. A VPN provides encrypted tunnels, WAN accelerators provide caching and compression features, DNS services store the hierarchy of the network and switches/routers route and forward the packets to the right destination. The core-routers form the backbone of the network and connect multiple data centers and clouds.

Each of the above network components can be mapped to an equivalent in the Open vStorage architecture. The Open vStorage Edge offers a block interface to the applications (Hypervisors, Docker, …) on the server. Just like the TCP-IP stack converts the data into network packets, the Volume Driver converts the data received through the block interface into objects (Storage Container Objects). Next we have the proxy which takes up many roles: it encrypts the data for security, provides compression and routes the SCOs after chopping them down in fragments to the right backend. For reads the proxy also plays an important caching role by fetching the data from the correct cache backend. Lastly we have Arkoon, our own distributed key-value store, which stores the metadata of all data in the backend of the storage cluster. A backend consists out of SSDs and HDD in JBODs or traditional x86 servers. There can of course be multiple backends and they can even be spread across multiple data centers.

Conclusion

When reading the first alinea of this blog post it might have crossed your mind that I was crazy. I do hope that after reading through the whole post you realized that networking and storage have a lot in common. As a Product Manager I keep the path that networking has already covered in mind when thinking about the future of storage. How do you see the future of storage? Let me know!

Open vStorage High Availability (HA)

Last week I received an interesting question from a customer:

What about High-Availability (HA)? How does Open vStorage protect against failures?

This customer was right to ask that question. In case you run a large scale, multi-petabyte storage cluster, HA should be one of your key concerns. Downtime in such a cluster doesn’t only lead to production loss but might be a real PR disaster or even lead to foreclosure. When end-customers start leaving your service, it can become a slippery slope and before you are aware there is no customer left on your cluster. Hence, asking the HA question beforehand is a best practice for every storage engineer challenged with doing a due diligence of a new storage technology. Over the past few years we already devoted a lot of words to Open vStorage HA so I thought it was time for a summary.

In this blog post I will discuss the different HA scenarios starting from top (the edge) to bottom (the ASD).

The Edge

To start an Edge block device, you need to pass the IP and port of a Storage Router with the vPool of the vDisk. On initial connection the Storage Router will return to the Edge a list of fail-over Storage Routers. The Edge caches this information and switches automatically to another Storage Router in case it can’t communicate with the Storage Router for 15 seconds.
Periodically the Edge also asks the Storage Router to which Storage Router it should connect. This way the Storage Router can instruct the Edge to connect to another Storage Router, for example because the original Storage Router will be shut down.
For more details, check the following blog post about Edge HA.

The Storage Router

The Storage Router also has multiple HA features for the data path. As a vDisk can only be active and owned by a single Volume Driver, the block to object conversion process of the Storage Router, a mechanism is in place to make sure the ownership of the vDisks can be handed over (happy path) or stolen (unhappy path) by another Storage Router. Once the ownership is transferred the volume is started on the new Storage Router and IO requests can be processed. In case the old Storage Router would still try to write to the backend, fencing will kick in which prevents data to be stored on the backend.
The ALBA proxy is responsible for encrypting, compressing and erasure code the Storage Container Objects (SCOs) coming from the Volume Driver and sending the fragments to the ASD processes on the SSD/SATA disks. Each Storage Router also has multiple proxies and can switch between these proxies in cases of issues and timeouts.

The ALBA Backend

An ALBA backend typically consist out of a multiple physical disks across multiple servers. The proxies generate redundant parity fragments via erasure coding which are stored across all devices of the backend. As a result, a device or even a complete server failure doesn’t lead to data loss. On top, backends can be recursively composed. Let’s take as example the case where you have 3 data centers. One could create a (local) backend containing the disks of each data center and create a (global) backend on top of these these (local) backends. Data could for example be replicated 3 times, one copy in each data center, and erasure coded within the data center for storage efficiency. Using this approach a data center outage wouldn’t cause any data loss.

The management path HA

The previous sections of this blog post discussed the HA features of the data path. The management path is also high available. The GUI and API can be reached from all master nodes in the cluster. The metadata is also stored redundantly and is spread across multiple nodes or even data centers. Open vStorage has 2 types of metadata: the volume metadata and the backend metadata. The volume metadata is stored in a networked RocksDB using a master-slave concept. More information about that can be found here and in a video here.
The backend metadata is stored in our own, in-house developed, always consistent key-value store named Arakoon. More info on Arakoon can be found here.

That’s in a nutshell how Open vStorage makes sure a disk, server or data center disaster doesn’t lead to storage downtime.