Keeping an eye on an Open vStorage cluster

Open vStorage offers as part of the commercial package 2 options to monitor an Open vStorage cluster. The OPS team acts as a second set of eyes or the OPS team has the keys, is in the driving seat and has full control. In both cases these large scale (+5PB) Open vStorage clusters send the logs to a centralized monitoring cluster managed by the OPS team. This custom monitoring cluster is based based upon scalable tools such as Elasticsearch, InfluxDB, Kibana, Grafana and CheckMK. Let’s have a look at the different components the OPS team uses. Note that these tools are only part of the Open vStorage commercial package.

Elasticsearch & Kibana

To expose the internals of an Open vStorage cluster, the team opted to run an ELK (Elasticsearach, Logstash, Kibana) stack to gather logging information and centralise all this information into a single viewing pane.

The ELK-stack consists of 3 open source components:

  • Elasticsearch: a NoSQL database, based on Apache’s Lucene engine, which stores all log files.
  • Logstash: a log pipeline tool which accepts various inputs and targets. In our case, it will read logging from a Redis queue and store them into Elasticsearch.
  • Kibana: a visualisation tool on top of Elasticsearch.

Next to the ELK stack, Journalbeat is used to fetch the logging from all nodes of the cluster and put them onto Redis. Logstash consumes the Redis queue and stores the log messages into Elasticsearch. By aggregating all logs from a cluster into a single, unified view, detecting anomalies or finding correlation between issues is easier.

InfluxDB & Grafana

The many statistics that are being tracked are stored into an InfluxDB, an open source database specifically designed to handle time series data. On top of the InfluxDB Grafana is used to visualize these statistics. The dashboards give a detailed view on the performance metrics of the cluster as a whole but also of the individual components. The statistics are provided in an aggregated view but a OPS member can also drill down to the smallest detail such as the individual vDisks level. The metrics that are tracked range from IO latency at different levels, throughput and operations per second, safety of the objects in the backend to the amount of maintenance tasks that are running across the cluster.

CheckMK

To detect and escalate issues the Open vStorage team uses CheckMK, an extension to the open source Nagios monitoring system. The CheckMK cluster is loaded with many monitoring rules based upon years of experience in monitoring large scale (storage) clusters. These monitoring rules includes general checks such as the CPU and RAM of a host, the services, network performance and disk health but of course specific checks for Open vStorage components such as the Volume Driver or Arakoon have also been added. The output of the healthcheck also gets parsed by the CheckMK engine. In case of issues a fine-tuned escalation process is put into motion in order to resolve these issues quickly.

Open vStorage 2.1

It is with great pleasure I introduce Open vStorage 2.1. Yes, we went straight from version 1.6 to 2.1. We just had so much interesting features to add that we just couldn’t call it 2.0.

It is important to know that Open vStorage now comes in 2 flavors: a free, unrestricted version and a free, restricted community version which includes our own new Open vStorage backend and allows to run Open vStorage as hyperconverged solution. At the moment both versions only feature community support. The unrestricted version is open-source and allows to add almost any S3 compatible backend (Ceph, Swift, Cloudian, …). The community version is the restricted version of our future paying product which includes support. A paying Open vStorage version will be released in June. In case you want to run Open vStorage hyperconverged out of the box, you will need to have the Open vStorage Backend which is highly optimized to be used with Open vStorage.

So what is new in 2.1 compared to 1.6:

  • Run Open vStorage as hyperconverged solution: you can now use local SATA disks inside the host as (cold) storage backend for data coming out of the write cache. Open vStorage is now hyperconverged and supports hot-swap disks. For our free community edition you can go upto 4 hosts, 16 disks and 49 vDisks. Currently only a limited set of RAID controllers are supported (LSI). In case you want to use Open vStorage in combination with the Seagate Kinetic drives, the Open vStorage Backend will also be required (future version).
  • Flexible cache layout: the Open vStorage setup is now more flexible and allows to identify multiple SSDs as read cache device. During the setup you can also indicate which device you want to select as write cache. When you create a vPool this will be taken into account when presenting default values.
  • Improved supportability: you now have the option to send heartbeats to our datacenter and if necessary open a VPN connection so we can offer remote help. There is also an option to download all logs straight from the GUI with a single mouse click.
  • New metadata server: when a volume was moved from one host to another, you typically had a few seconds up to a minute of downtime as the metadata had to be rebuilt on the new host. We now have a metadata server topology which supports a master/slave concept. In case the volume is moved and the master server is no longer accessible you can contact the slave metadata server. This means that the downtime will be only a few milliseconds. (Some more info https://www.youtube.com/watch?v=Yy2EhJkFr04)
  • Performance improvements: we now allow more outstanding data in the write cache before the data ingest coming from the VM will be limited.

In case you have questions, feel free to create a post in the Support Forum.