Deploy an Open vStorage cluster with Ansible

ansible_logoAt Open vStorage we build large Open vStorage clusters for customers. To prevent errors and cut-down the deployment time we don’t set up these clusters manually but we automate the deployment through Ansible, a free software platform for configuring and managing IT environments.

Before we dive into the Ansible code, let’s first have a look at the architecture of these large clusters. For large setups we take the converged (HyperScale as we call it) approach and split up storage and compute in order to scale compute and storage independently. From experience we have learned that storage on average grows 3 times as fast as compute.

We also use 3 types of nodes: controllers, compute and storage nodes.

  • Controllers: 3 dedicated, hardware optimized nodes to run the master services and hold the distributed DBs. There is no vPool configured on these nodes so no VMs are running on them.These nodes are equipped with a couple of large capacity SATA drives for scrubbing.
  • Compute: These nodes run the extra services, are configured with vPools and run the VMs. We typically use blades or 1U servers for these servers as they are only equipped with SSDs or PCIe flash cards.
  • Storage: The storage servers, 2U or 4U, are equipped with a lot of SATA drives but have less RAM and CPU.

The below steps will teach you how to setup an Open vStorage cluster through Ansible (Ubuntu) on these 3 types of nodes. Automating Open vStorage can of course also be achieved in a similar fashion with other tools like Puppet or Chef.

  • Install Ubuntu 14.04 on all servers of the cluster. Username and password should be the same on all servers.
  • Install Ansible on a pc or server you can use as Control Machine. The Control Machine is used to send instructions to all hosts in the Open vStorage cluster. Note that the Control Machine should not be part of the cluster so it can later also be used for troubleshooting the Open vStorage cluster.

    sudo apt-get install software-properties-common
    sudo apt-add-repository ppa:ansible/ansible
    sudo apt-get update
    sudo apt-get install ansible

  • Create /usr/lib/ansible, download the Open vStorage module to the Control Machine and put the module in /usr/lib/ansible.

    mkdir /opt/openvstorage/
    cd /opt/openvstorage/
    git clone -b release1.0 https://github.com/openvstorage/dev_ops.git
    mkdir /usr/lib/ansible
    cp dev_ops/Ansible/openvstorage_module_project/openvstorage.py /usr/lib/ansible

  • Edit the Ansible config file (/etc/ansible/ansible.cfg) describing the library. Uncomment it and change it to /usr/lib/ansible

    vim /etc/ansible/ansible.cfg

    ##change
    #inventory = /etc/ansible/hosts
    #library = /usr/share/my_modules/

    ##to
    inventory = /etc/ansible/hosts
    library = /usr/lib/ansible

  • Edit the Ansible inventory file (/etc/ansible/hosts) and add the controller, compute and storage nodes to describe the cluster according to the below example:

    #
    # This is the default ansible 'hosts' file.
    #

    #cluster overview

    [controllers]
    ctl01 ansible_host=10.100.69.171 hypervisor_name=mas01
    ctl02 ansible_host=10.100.69.172 hypervisor_name=mas02
    ctl03 ansible_host=10.100.69.173 hypervisor_name=mas03

    [computenodes]
    cmp01 ansible_host=10.100.69.181 hypervisor_name=hyp01

    [storagenodes]
    str01 ansible_host=10.100.69.191

    #cluster details
    [cluster:children]
    controllers
    computenodes
    storagenodes

    [cluster:vars]
    cluster_name=cluster100
    cluster_user=root
    cluster_password=rooter
    cluster_type=KVM
    install_master_ip=10.100.69.171

  • Execute the Open vStorage HyperScale playbook. (It is advised to execute the playbook in debug mode -vvvv)

    cd /opt/openvstorage/dev_ops/Ansible/hyperscale_project/
    ansible-playbook openvstorage_hyperscale_setup.yml -k -vvvv

The above playbook will install the necessary packages and run ‘ovs setup’ on the controllers, compute and storage nodes. Next steps are assigning roles to the SSDs and PCIe flash cards, create the backend and create the first vPool.

QEMU, Shared Memory and Open vStorage

qemuQEMU, Shared Memory and Open vStorage, it sounds like the beginning of a bad joke but actually it is a very cool story. Open vStorage secretly released in their latest version a Shared Memory Client/Server integration with the VolumeDriver (the component that offers the fast, distributed block layer). With this implementation the client (QEMU, Blktap, …) can write to a dedicated memory segment on the compute host which is shared with the Shared Memory Server in the Volume Driver. For the moment the Shared Memory client understands only block semantics but in the future we will add file semantics as to integrate an NFS server.

The benefits of the Shared Memory approach are very tangible:

  • As everything is in user-space, data copies from user to kernel space are eliminated so the IO performance is about 30-40% higher.
  • CPU consumption is about half for the same IO performance.
  • Easy way to build additional interfaces (f.e. block devices, iSCSI, … ) on top.

We haven’t integrated our modified QEMU build with Libvirt so at the moment some manual tweaking is still required if you want to give it a go:

Download the volumedriver-dev packages

sudo apt-get install volumedriver-dev

By default the Shared memory Server is disabled. To enable it, update the vPool json (/opt/OpenvStorage/config/storagedriver/storagedriver/vpool_name.json) and add under filesystem an entry “fs_enable_shm_interface”: true,. After adding the entry, restart the Volume Driver for the vPool (restart ovs-volumedriver_vpool_name).
Next, build QEMU from the source. You can find the source here.

git clone https://github.com/openvstorage/qemu.git
cd qemu/
./configure
make
sudo make install

There are 2 ways to create a QEMU vDisk:
Use QEMU to create the disk:

qemu-image create openvstorage:volume 10G

Alternatively create the disk in FUSE and start a VM by using the Open vStorage block driver:

truncate -s 10G /mnt//volume
qemu -drive file=openvstorage:volume,if=virtio,cache=none,format=raw ...

The Distributed Transaction Log explained

During my 1-on-1 sessions I quite often get the question how Open vStorage makes sure there is no data loss when a host crashes. As you probably already know Open vStorage uses SSDs and PCIe flash cards inside the host where the VM is running to store incoming writes. All incoming writes for a volume get appended to a log file (SCO, Storage Container Object) and once enough write are accumulated the SCO gets stored on the backend. Once the SCO is on the backend Open vStorage relies on the functionality (erasure coding, 3-way replication, …) of the backend to make sure that data is stored safely.

This means there is window where data is vulnerable, when the SCO is being constructed and not yet stored on the Backend. To ensure the vulnerable data isn’t lost when a host crashes, incoming writes are also stored in the Distributed Transaction Log (DTL) on another host in the Open vStorage cluster. Note that the volume can even be restarted on another host than were the DTL was stored.

For the DTL of volume you can select one of the following options as modus operandi:

  • No DTL: when this option is selected incoming data doesn’t get stored in the DTL on another node. This option can be used when performance is key and some data loss is acceptable when the host or storage router goes down. Test VMs or VMs which are running batch or distributed applications (f.e. transcoding of files to another file) can use this option.
  • Asynchronous: when this option is selected the incoming writes are added to a queue on the host and replicated to the DTL on the other host once the queue reaches a certain size or if a certain time is exceeded. To ensure consistency, all outstanding data is synced to the DTL in case a sync is executed within the file system of the VM. Virtual Machines running on KVM can use this option. This mode balances data safety and performance.
  • DTL - async

  • Synchronous: when this option is selected, every write request gets synchronized to the DTL on the other host. This option should be selected when absolutely no data loss is acceptable (distributed NFS, HA iSCSI disks). Since this options synchronizes on every write, it is the slowest mode of the DTL. Note that in case the DTL can’t be reached (f.e. because the host is being rebooted), the incoming I/O isn’t blocked and doesn’t return an I/O error to the VM but an out-of-band event is generated to restart the DTL on another host.
  • DTL - sync

Eugene Release

To start the new year with a bang, the Open vStorage Team is proud to release Eugene:

The highlights of this release are:

Policy Update
Open vStorage enables you to actively add, remove and update policies for specific ALBA backend presets. Updating active policies might result in Open vStorage to automatically rewrite data fragments.

ALBA Backend Encryption
When configuring a backend presets, AES-256 encryption algorithms can be selected.

Failure Domain
A Failure Domain is a logical grouping of Storage Routers. The Distributed Transaction Log (DTL) and MetaDataServer (MDS) for Storage Router groups can be defined in the same Failure Domain or in a Backup Failure domain. When the DTL and MDS are defined in a Backup Failure Domain, data loss in case of a non-functioning Failure Domain is prevented. Defining the DTL and MDS in a backup Failure Domain requires low latency network connections.

Distributed Scrubber
Snapshots which are out of retention period are indicated as garbage and removed by the Scrubber. With the Distributed Scrubber functionality you can now decide to run the actual scrubbing process away from the host that holds the volume. This way, hosts that are running Virtual Machines do not experience any performance hit when the snapshots of those Virtual Machines are scrubbed.

Scrubbing Parent vDisks
Open vStorage allows to create clones of vDisks. The maximal depth of the clone tree is limited to 255. When a clone is created, scrubbing is still applied to the actual parent of the clone.

New API calls
Following API’s are added:

  • vDisk templates (set and create from template)
  • Create a vDisk (name, size)
  • Clone a vMachine from a vMachine Snapshot
  • Delete a vDisk
  • Delete a vMachine snapshot

These API calls are not exposed in the GUI.

Removal of the community restrictions
The ALBA backend is no longer restricted and you are no longer required to apply for a community license to use ALBA. The cluster needs to be registered within 30 days otherwise the GUI will stop working until the cluster is registered.

Remove Node
Open vStorage allows for nodes to be removed from the Open vStorage Cluster. With this functionality you can remove any node and scale your storage cluster along with your changing storage requirements. Both active and broken nodes can be consistently removed from the cluster.

Some smaller Feature Requests were added also:

  • Removal of the GCC dependency.
  • Option to label a manual snapshot as ‘sticky’ so it doesn’t get removed by the automated snapshot cleanup.
  • Allow stealing of a volume when no Hypervisor Management Center is configured and the node rowning the volume is down.
  • Set_config_params for vDisk no longer requires the old config.
  • Automatically reconfigure the DTL when DTL is degraded.
  • Automatic triggering of a repair job when an ASD is down for 15 minutes.
  • ALBA is independent of broadcasting.
  • Encryption of the ALBA socket communication.
  • New Arakoon client (pyarakoon).

Following are the most important bug fixes in the Eugene release:

  • Fix for various issues when first node is down.
  • “An error occurred while configuring the partition” while trying to assign DB role to a partition.
  • Cached list not updated correctly.
  • Celery workers are unable to start.
  • Nvme drives are not correctly detected.
  • Volume restart fails due to failure while clearing the DTL.
  • Arakoon configs not correct on 4th node.
  • Bad MDS Slave placement.
  • DB role is required on every node running a vPool but isn’t mandatory.
  • Exception in tick crashes the ovs-scheduled-task service.
  • OVS-extensions is very chatty.
  • Voldrv python client hangs if node1 is down.
  • Bad logic to decide when to create extra alba namespace hosts.
  • Timeout of CHAINED ensure single decorator is not high enough.
  • Possibly wrong master selected during “ovs setup demote”.
  • Possible race condition when adding nodes to cluster.

2016: Cheers to the New Year!

The past year has been a remarkable one for Open vStorage. We did 2 US roadshows, attended a successful OpenStack summit in Vancouver, moved and open-sourced all of Open vStorage on GitHub and released a lot of new functionality (our own hyperconverged backend, detailed tuning & caching parameters for vDisks, a certified OpenStack Cinder plugin, remote support, CentOS7, …). The year also ended with a bang as customers were trying to beat each other’s top 4k IOPS results.

photo_2015-12-14_13-21-09

While it might look hard to beat the success of 2015, the Open vStorage Team is confident that 2016 will be even more fruitful. Feature wise some highly anticipated features will see the light in Q1: improved QEMU integration, block devices (blktap support), docker support (Flocker), iSCSI disks, replication, support for an all-flash backend, … Next to these product features, the team will open its kimono and discuss the Open vStorage internals in detail in our new GitBook documentation. In order to be more in spirit with the open source community, blueprints of upcoming features will also be published as much as possible to the GitHub repo. A first example is the move of the Open vStorage config files to etcd. Finally, based upon the projects of partners and customers in the pipeline, 2016 will be unrivaled. So far a 39 node cluster (nearly 125TB of flash alone) is being deployed in production and multiple datacenters (US, Europe, Asia) are being transformed to use Open vStorage as storage platform.

Cheers to the New Year. May it be a memorable one.