Distributed Config Management

Distributed Config ManagementWhen you are managing large clusters, keeping the configuration of every system up to date can be quite a challenge: new nodes are joining the cluster, old nodes need to be replaced, vPools are created and removed, … . In Eugene and earlier versions we relied on simple config files which were located on each node. It should not come as a surprise that in large clusters it proved to be a challenge to keep the config files in sync. Sometime a clusterwide config parameter was updated while one of the nodes was being rebooted. This had as consequence that the update didn’t make it to the node and after the reboot it kept running with an old config.
For Fargo we decided to tackle this problem. The answer: Distributed Config Management.

Distributed Config Management

All config files are now stored in a distributed config management system. When a component starts, it now retrieves the latest configuration settings from the management system. Let’s have a look at how this works in practice. For example a node is down and we remove the vPool from that node. As the vPool was shrunk, the config for that VolumeDriver is removed from the config management system. When the node restarts it will try to get the latest configuration settings for the vPool from the config management system. As there is no config for the removed vPool, the VolumeDriver will no longer serve the vPool. In a first phase we have added support for Arakoon, our beloved and in-house developed distributed key/value store, as distributed config management system. As an alternative to Arakoon, ETCD has been incorporated but do know that in our own deployments we always use Arakoon (hint).

How to change a config parameter:

Changing parameters in the config management system is very easy through the Open vStorage CLI:

  • ovs config list some: List all keys with the given prefix.
  • ovs config edit some-key: Edit that key in your configured editor. If the key doesn’t exist, it will get created.
  • ovs config get some-key: Print the content of the given key.

The distributed config management also contains a key for all scheduled tasks and jobs. To update the default schedule, edit the key /ovs/framework/scheduling/celery and plan the tasks by adding a crontab style schedule.

Dedupe: The good the bad and the ugly

the-good-the-bad-and-the-uglyOver the years a lot has been written about deduplication (dedupe) and storage. There are people who are dedupe aficionados and there are dedupe haters. At Open vStorage we take a pragmatic approach: we use deduplication when it makes sense. When the team behind Open vStorage designed a backup storage solution 15 years ago, we developed the first CAS (Content Addressed Storage) based backup technology. Using this deduplication technology, customers required 10 times less storage for typical backup processes. As said, we use deduplication when it makes sense and that is why we have decided to disable the deduplication feature in our latest Fargo release.

What is deduplication:

Deduplication is a technique for eliminating duplicate copies in data. This is done by identifying and fingerprinting unique chunks of data. In case a duplicate chunk of data is found, it is replaced by a reference or pointer to the first encountered chunk of data. As the pointer is typically smaller than the actual chunk of data, the amount of storage space to store the complete set of data can hence be reduced.

The Good, the Bad, the Ugly

The Good
Duplication can be a real lifesaver in case you need to store a lot of data on a small device. The deduplication ratio, the amount of storage reduction, can be quite substantial in case there are many identical chunks of data (think the same OS) and if the size of the chunks is a couple of multitudes larger than the size of the pointer/fingerprint.

The Bad
Deduplication can be CPU intensive. It requires to fingerprint each chunk of data and fingerprinting (calculating a hash) is an expensive CPU instruction. This performance penalty will introduce additional latency in the IO write path.

The Ugly
The bigger the size of the chunk, the less likely chunks will be duplicates as even the smallest change of a bit will make sure the chunks are no longer identical. But the smaller the chunks, the smaller the ratio between the chunksize and the fingerprint. This has as consequence that the memory footprint for storing the fingerprints can be large in case a lot of data needs to be stored and the chunk size is small. Especially in large scale environments this is an issue as the hash table in which the fingerprints are stored can be too big to fit in memory.

Another issue is the fact the hash table might get corrupt which basically means your whole storage system is corrupt as the data is still on disk but you lost the map as to where every chunk is stored.

Block storage reality

It is obvious that deduplication only makes sense in case the data to be stored contains many duplicate chunks. Today’s applications already have deduplication built-in at the application level or generate blocks which can’t be deduped. Hence enabling deduplication introduces a performance penalty (additional IO latency, heavier CPU usage, …) without any significant space savings.

Deduplication also made sense when SSD were small in size and expensive compared with traditional SATA drives. By using deduplication it was possible to store more data on the SSD while the penalty of the deduplication overhead was still small. With the latest generation of NVMe drives both arguments have disappeared. The size of NVMe drives is almost on par with SATA drives and the cost has decreased significantly. The latency of these devices is also extremely low, bringing them in range of the overhead introduced by the deduplication. The penalty of deduplication is just too big when using NVMe.

At Open vStorage we try to make the fastest possible distributed block storage solution. In order to keep the performance consistently fast it is essential that the metadata can fit completely in RAM. Every time we need to go to an SSD for metadata, the performance will drop significantly. With deduplication enabled, the metadata size per LBA entry was 8 bit for the SCO and offset and 128 bit of the hash. Hence by eliminating deduplication we can store 16 times more metadata in RAM. Or in our case, we can address a storage pool which is 16 times bigger with the same performance as compared to with deduplication enabled.

One final remark, Open vStorage still uses deduplication when a clone is made from a volume. The clone and its parent share the data upto the point at which the volume is cloned and only the changes to the cloned volume are stored on the backend. This can easily and inexpensively be achieved with 8 bits and they share the same SCOs and offsets.

A healthier cluster begins with OPS: the Open vStorage Health Check

keep-calm-and-let-the-ops-team-handle-itWith more and more big size Open vStorage clusters being deployed, the Open vStorage Operations (OPS) team is tasked with monitoring more servers. In the rare case there is an issue with a cluster, the OPS team wants to get a quick idea of how serious the problems is. That is why the Open vStorage OPS team added another project to the GitHub repo: openvstorage-health-check.

The Open vStorage health check is a quick diagnostic tool to verify if all components on an Open vStorage node are working fine. It will for example check if all services and Arakoon databases are up and running, Memcache, RabbitMQ and Celery are behaving and if presets and backends are still operational.

Note that the health check is only a diagnostic tool. Hence it will not take any action to repair the cluster.

Get Started:

To install the Open vStorage health check on a node, execute:

apt-get install openvstorage-health-check

Next, run the health check by executing

ovs healthcheck

As always, this is work in progress so feel free to file a bug or a feature request for missing functionality. Pull Request are welcomed and will be accepted after careful review by the Open vStorage OPS team.

An example output of the Open vStorage health check:

root@perf-roub-04:~# ovs healthcheck
[INFO] Starting Open vStorage Health Check!
[INFO] ====================================
[INFO] Fetching LOCAL information of node:
[SUCCESS] Cluster ID: 3vvwuO9dd1S2sNIi
[SUCCESS] Hostname: perf-roub-04
[SUCCESS] Storagerouter ID: 6Y6uerfmfZaoZOCu
[SUCCESS] Storagerouter TYPE: EXTRA
[SUCCESS] Environment RELEASE: Fargo
[SUCCESS] Environment BRANCH: Unstable
[INFO] Checking LOCAL OVS services:
[SUCCESS] Service ‘ovs-albaproxy_geo-accel-alba’ is running!
[SUCCESS] Service ‘ovs-workers’ is running!
[SUCCESS] Service ‘ovs-watcher-framework’ is running!
[SUCCESS] Service ‘ovs-dtl_local-flash-roub’ is running!
[SUCCESS] Service ‘ovs-dtl_local-hdd-roub’ is running!

[INFO] Checking ALBA proxy ‘albaproxy_local-flash-roub’:
[SUCCESS] Namespace successfully created or already existed on proxy ‘albaproxy_local-flash-roub’ with preset ‘default’!
[SUCCESS] Creation of a object in namespace ‘ovs-healthcheck-ns-default’ on proxy ‘albaproxy_local-flash-roub’ with preset ‘default’ succeeded!
[SUCCESS] Namespace successfully created or already existed on proxy ‘albaproxy_local-flash-roub’ with preset ‘high’!
[SUCCESS] Creation of a object in namespace ‘ovs-healthcheck-ns-high’ on proxy ‘albaproxy_local-flash-roub’ with preset ‘high’ succeeded!
[SUCCESS] Namespace successfully created or already existed on proxy ‘albaproxy_local-flash-roub’ with preset ‘low’!
[SUCCESS] Creation of a object in namespace ‘ovs-healthcheck-ns-low’ on proxy ‘albaproxy_local-flash-roub’ with preset ‘low’ succeeded!
[INFO] Checking the ALBA ASDs …
[SKIPPED] Skipping ASD check because this is a EXTRA node …
[INFO] Recap of Health Check!
[INFO] ======================
[SUCCESS] SUCCESS=154 FAILED=0 SKIPPED=20 WARNING=0 EXCEPTION=0