My wife thinks I’m weird but she loves me anyway. She is probably right as today I got excited about metadata. Metadata of volumes to be more exact. Let me explain why I got so excited. Open vStorage keeps track of which LBA of a volume contains which data, the metadata of a volume. Instead of storing the 4k data block next to the LBA in some kind of database, we create a mapping between the LBA, the place where the data is actually stored (Storage Container ID and the offset) and a hash of the 4k block. I’ve discussed this already in a previous blog post. Now, why am I so excited about this you ask?
In the version 1.6 of Open vStorage this mapping, the metadata database, is only stored locally on the host where the volume is served (typically where the Virtual Machine is running). Under normal circumstances it isn’t a problem that this data is only local as only that host is using this metadata database. When you move a Virtual Machine between hosts this locality is a drawback. Before the new host can start accepting data for the moved volume, the metadata database for that volume needs to be reconstructed with data from the backend. In reality this means getting all TLOGs for the volume from the backend and replaying them from the first to the last. Remember, each TLOG-entry contains the metadata for a specific write: the LBA, the location (a combination on the SCO name and the offset within the SCO) and a hash. Once all TLOGs are replayed, the metadata is reconstructed and the volume can start accepting new IO requests. In case the volume has received a lot of write IO, many TLOGs need to be fetched from the backend so it will take a few seconds, in worst case scenarios even up to a minute, before the volume is available. It works but it isn’t ideal.
In the new version of Open vStorage we have fundamentally changed the metadata architecture. We now have added role based functionality to the metadata server. A metadata server can be the master (database) or a slave (database) for a certain volume. Typically the master role will run on the same host as where the volume is running. This means performance is top-notch. Next to the master role, by default an additional slave role will be created on the metadata server of another host. The slave is almost up to date with the master so when it is promoted to be the master, only a couple TLOG-entries have to be replayed.
The real value of the master/slave architecture comes to play when you move a Virtual Machine between hosts. When the original metadata server with the master role is still available, f.e. in case of a live migration, the volume will immediately be able to receive IO requests as underneath the master metadata server will be consulted. As you are going over the network for each metadata lookup, there will be a small performance drop. As soon as the new host discovers it needs to go over the network to another host for the metadata, it will in the background create a slave copy locally. To do this it starts fetching the TLOGs from the backend and replays them so the metadata of the moved volume becomes available locally. Once the local metadata server is up to date, the local slave is promoted to become the master and the metadata is looked up locally from then on.
Let’s illustrate this with an example. In the below scenario VM3 is moved to a second host. The VM3 will immediately be able to receive IO as the metadata server on the original host is still accessible.
Another use case where the master/slave architecture shows its value is in case a host goes down and the Virtual Machines and corresponding volumes are restarted on another host. As the slave is almost up to date with the master only a few TLOG-entries need to be replayed. The volume will almost instantly be accessible. It doesn’t really matter if the Virtual Machine is running on the same host as the promoted slave or on a different host. In case it runs on a different host, the master will first be consulted over the network.
In the background a local slave will be created and the necessary TLOGS will be fetched from the backend to recreate the metadata locally. Once the slave on the host where the Virtual Machine is running is up to date, it will of course be promoted to master and metadata lookups will happen locally.
Together with this master slave concept we also added functionality which detects if a metadata server is overloaded. Once this is detected, another metadata server will be created on the same host to make sure the performance doesn’t suffer. As you can see, I have all reasons to be excited about metadata!