The Storage and Networking DNA.

Today I want to discuss a less technical but more visionary assertion: the concept that storage and networking share the same DNA. This is especially the case for large scale deployments. This insight surfaced as the rumor of Cisco buying Netapp roared again. Allow to me to explain why I believe exabyte storage clusters and large scale networks have a lot in common.

The parallels in storage and networking:

The first feature both networks and exabyte storage share is that they are highly scalable. Both topologies typically start small and grow overtime. Adding more capacity can be achieved seamlessly by adding more hardware to the cluster. This allows for a higher bandwidth, higher capacity and more users to be served.

Downtime is typically unacceptable for both and SLAs to ensure a multi-nine availability are common. To achieve this level of availability both rely on hyper-meshed, shared nothing architectures. These highly redundant architectures ensure that if one component fails another component takes over. To illustrate, switches typically are used in a redundant fashion as a single server is connected to 2 independent switches. If one switch fails the other one takes over. The same holds for storage. Data is also stored redundant. This could be achieved with replication or erasure coding across multiple disks and servers. If a disk or server would fail, data can still be retrieved from other disks and servers in the storage cluster.

These days you can check your Facebook timeline or Twitter account from almost anywhere in the world. Large scale networks allow users to have access from anywhere in the world. This global network spans across the globe and interlinks different smaller networks. The same holds for storage as we are moving to a world where data is stored in geographically dispersed places and even in different clouds.

With new technologies like Software-Defined Networking (SDN) network management has moved towards a Single point of Governance. Accordingly the physical network can be configured on a high level while the detailed network topology is pushed down to the physical and virtual devices that make up the network. The same trend is happening in the storage industry with Software-Defined Storage (SDS). These software applications allow to configure and manage the physical hardware in the storage cluster, even across multiple devices, sites and even different clouds through a single high-level management view.

A last point I’d like to touch is that for both networking and storage, the hardware brands and models hardly matter as they can all work together due to network standards. The same goes for storage hardware. Different brands of disks, controllers and servers can all be used to build an exabyte storage cluster. Users of the network are not aware of the exact topology of the network (brands, links, routing, …). The same holds for storage. The user shouldn’t know on which disk his data is stored exactly, the only thing he cares about is that he or she gets the right data on time when needed and it is safely stored.

Open vStorage, taking the network analogy to the next step

Let’s have a look at the components of a typical network. On the left we have the consumer of the network, in this case a server. This server is physically connected with the network through a Network Interface Controller (NIC). A NIC driver provides the necessary interfaces for the application on the server to use the network. Data which is sent down the network traverses the TCP-IP stack down to the NIC where data is converted into individual packets. Within the network various components play a specific role. A VPN provides encrypted tunnels, WAN accelerators provide caching and compression features, DNS services store the hierarchy of the network and switches/routers route and forward the packets to the right destination. The core-routers form the backbone of the network and connect multiple data centers and clouds.

Each of the above network components can be mapped to an equivalent in the Open vStorage architecture. The Open vStorage Edge offers a block interface to the applications (Hypervisors, Docker, …) on the server. Just like the TCP-IP stack converts the data into network packets, the Volume Driver converts the data received through the block interface into objects (Storage Container Objects). Next we have the proxy which takes up many roles: it encrypts the data for security, provides compression and routes the SCOs after chopping them down in fragments to the right backend. For reads the proxy also plays an important caching role by fetching the data from the correct cache backend. Lastly we have Arkoon, our own distributed key-value store, which stores the metadata of all data in the backend of the storage cluster. A backend consists out of SSDs and HDD in JBODs or traditional x86 servers. There can of course be multiple backends and they can even be spread across multiple data centers.

Conclusion

When reading the first alinea of this blog post it might have crossed your mind that I was crazy. I do hope that after reading through the whole post you realized that networking and storage have a lot in common. As a Product Manager I keep the path that networking has already covered in mind when thinking about the future of storage. How do you see the future of storage? Let me know!

Fargo: the updated Open vStorage Architecture

With the Fargo release of Open vStorage we are focussing even more on the Open vStorage sweet spot: multi-petabyte, multi-datacenter storage clusters which offer super-fast block storage.
In order to achieve this we had to significantly change the architecture for the Fargo release. Eugene, the version before Fargo, already had the Shared Memory Server (SHM) in its code base but its wasn’t activated by default. The Fargo release now primarily uses the SHM approach. To make even more use of it, we created the Open vStorage Edge. The Edge is a lightweight block storage driver which can be installed on Linux servers (hosts running the hypervisor or inside the VM) and talks across the network to the Shared Memory of a remote Volume Driver. Both TCP/IP and the low latency RDMA protocol can be used to connect the Edge with the Volume Driver. Northbound the Edge has an iSCSI, Blktap and QEMU interface. Additional interfaces such as iSER and FCoE are planned. Next to the new Edge interface, the slower Virtual Machine interface which exposes a Virtual File System (NFS, FUSE), is still supported.

Architecture

The Volume Driver has also been optimized for performance. The locks in the write path have been revised in order to minimize their impact. More radical is the decision to remove the deduplication functionality from the Volume Driver in order to keep the size of the metadata of the volumes to a strict minimum. By removing the bytes reserved for the hash, we are capable of keeping all the metadata in RAM and push the performance across 1 million IOPS per host on decent hardware. For those who absolutely need deduplication there is still a version available of the Volume Driver which has support for deduplication.

With the breakthrough of RDMA, the network bottleneck is removed and network latency is brought down to a couple of microseconds. Open vStorage makes use of the possibilities RDMA offers to implement a shared cache layer. To achieve this it is now possible to create an ALBA backend out of NVMe or SSD devices. This layer acts as a local, within a single datacenter, cache layer in front of an SATA ALBA backend, the capacity tier, which is spread across multiple datacenters.
This means all SSDs in a single datacenter devise a shared cache for the data of that datacenter. This minimizes the impact of an SSD failure and removes the cold cache effect when moving a volume between hosts. In order to minimize the impact of a single disk failure we introduced the NC-ECC (Network and Clustered Error Correction Codes) algorithm. This algorithm can be compared with solving a Sudoku puzzle. Each SCO, a collection of consecutive writes, is chopped up in chunks. All these chunks are distributed across all the nodes and datacenters in the cluster. The total amount of chunks can be configured but allows for example to recover from a multi node failure or a complete datacenter loss. A failure, whether it is a disk, node or datacenter will cross out some numbers from the complete Sudoku puzzle but as long as you have enough numbers left, you can still solve the puzzle. The same goes for data stored with Open vStorage: as long as you have enough chunks (disk, nodes or datacenters) left, you can always recover the data. The NC-ECC algorithm is based on forward error correction codes and is further optimized for usage within a multi-datacenter approach. When there is a disk or node failure, additional chunks will be created using only data from within the same datacenter. This ensures the bandwidth between datacenters isn’t stressed in case of a simple disk failure.

By splitting up the Edge, the Volume Driver, the cache layer and the capacity tier, you have the ultimate flexibility to build the storage cluster of your needs. You can run everything on the same server, hyperconverged, or you can install each component on a dedicated server to maximize scalability and performance.

The first alpha version of Fargo is now available on the repo.