Open vStorage offers as part of the commercial package 2 options to monitor an Open vStorage cluster. The OPS team acts as a second set of eyes or the OPS team has the keys, is in the driving seat and has full control. In both cases these large scale (+5PB) Open vStorage clusters send the logs to a centralized monitoring cluster managed by the OPS team. This custom monitoring cluster is based based upon scalable tools such as Elasticsearch, InfluxDB, Kibana, Grafana and CheckMK. Let’s have a look at the different components the OPS team uses. Note that these tools are only part of the Open vStorage commercial package.
Elasticsearch & Kibana
To expose the internals of an Open vStorage cluster, the team opted to run an ELK (Elasticsearach, Logstash, Kibana) stack to gather logging information and centralise all this information into a single viewing pane.
The ELK-stack consists of 3 open source components:
- Elasticsearch: a NoSQL database, based on Apache’s Lucene engine, which stores all log files.
- Logstash: a log pipeline tool which accepts various inputs and targets. In our case, it will read logging from a Redis queue and store them into Elasticsearch.
- Kibana: a visualisation tool on top of Elasticsearch.
Next to the ELK stack, Journalbeat is used to fetch the logging from all nodes of the cluster and put them onto Redis. Logstash consumes the Redis queue and stores the log messages into Elasticsearch. By aggregating all logs from a cluster into a single, unified view, detecting anomalies or finding correlation between issues is easier.
InfluxDB & Grafana
The many statistics that are being tracked are stored into an InfluxDB, an open source database specifically designed to handle time series data. On top of the InfluxDB Grafana is used to visualize these statistics. The dashboards give a detailed view on the performance metrics of the cluster as a whole but also of the individual components. The statistics are provided in an aggregated view but a OPS member can also drill down to the smallest detail such as the individual vDisks level. The metrics that are tracked range from IO latency at different levels, throughput and operations per second, safety of the objects in the backend to the amount of maintenance tasks that are running across the cluster.
To detect and escalate issues the Open vStorage team uses CheckMK, an extension to the open source Nagios monitoring system. The CheckMK cluster is loaded with many monitoring rules based upon years of experience in monitoring large scale (storage) clusters. These monitoring rules includes general checks such as the CPU and RAM of a host, the services, network performance and disk health but of course specific checks for Open vStorage components such as the Volume Driver or Arakoon have also been added. The output of the healthcheck also gets parsed by the CheckMK engine. In case of issues a fine-tuned escalation process is put into motion in order to resolve these issues quickly.
During the summer the Open vStorage Team has worked very hard. With this new release we can proudly present:
- Feature complete OpenStack Cinder Plugin: our Cinder Plugin has been improved and meets the minimum features required to be certified by the OpenStack community.
- Flexible cache layout: in 1.5.0 you have the capability to easily configure multiple SSD devices for the different caching purposes. During the setup you can choose which SSD’s to partition and then later when creating a vPool you can select which caching device should be used for read, write and writecache protection. Meaning these can from now on be spread over different or consolidated into the same SSD device depending on the available hardware and needs.
- User management: an admin can now create more user which have access to the GUI.
- Framework performance: a lot of work has been put into improving the performance when a lot of vDisks and vMachines are created. Improvements upto 50% has been reached in some cases.
- Improved API security by means of implementing OAuth2 authentication. A rate-limit has also been imposed on API calls to prevent brute force attacks.
Fixed bugs and small items:
- GUI now prevents creation of vPools with a capital letter.
- Implemented recommendation for a security exploit on elasticsearch 1.1.1.
- Fix for validation of vPools being stuck on validating.
- Protection against reusing vPool names towards the same backend.
- Fix for the Storage Router online/offline detection which failed when openstack was also installed.
Next, we also took the first step towards supporting other OS than Ubuntu (RedHat/Centos). We have created an rpm version of our volumedriver and arakoon packages. These are tested on “Linux Centos7 3.10.0-123.el7.x86_64” and can be downloaded from our packages server. This completes a first important step towards getting Open vStorage RedHat/CentOS compatible.
The Open vStorage team is on fire. We released a new version of the Open vStorage software (select Test as QualityLevel). The new big features for this release are:
- Logging and log collecting: we have added logging to Open vStorage. The logs are also centrally gathered and stored in a distributed Search Engine (Elasticsearch). The logs can be viewed, browsed, searched and analyzed through Kibina, a very nice designed GUI.
- HA Support: Open vStorage does with this release not only support vMotion but also the HA functionality of VMware. This means that in case an ESXi Host dies and vCenter starts the VMs on another Host, the volumes will automatically be migrated along.
Some small feature that got into this release:
- Distributed filesystem can now be selected as Storage Backend in the GUI. Earlier you could select the file system but you could not extend it to more Grid Storage Routers (GSR). Now you can extend it across more GSRs and do vMotion on top of f.e. GlusterFS.
- The status of VMs on KVM is now updated quicker in the GUI.
- Manager.py has now an option to specify the version you want to install (run manager.py -v ‘version number’)
- Under administration there is now an About Open vStorage page displaying the version of all installed components.
In addition, the team also fixed 35 bugs.