Distributed storage options for containers and VMs
This would be a good time to circle back to Stateless and Swiss engineer post where we talk about scale out and scale up scenarios.
We have already covered networking for containers in some depth and failover and load balancing. We have touched on storage with the article on using LXC containers with Gluster. This is particularly important in the context of this post and those looking for solutions for distributed storage for containers should read that first.
We will cover orchestration in a series of upcoming posts with Salt, but let's take a quick overview of storage today and why you may need to use the above tools once you scale beyond a single host.
The need for shared storage
When you scale beyond a single host you need to think about replication or shared storage so state is maintained across servers.
Replication is when you replicate data across hosts, for instance if you are running a mail server or any app and you need to scale to 2 instances you will need to replicate data across the 2 mail servers. Or have the 2 servers share the storage so users see the same inbox. This is maintaining state. You will also need a way distribute load to the 2 servers which is what a load balancer does.
Now take this to n servers and you have a bit of a problem on your hands. Replicating across n servers in real time is a challenge. A lot of databases operate in clustered mode, for instance Mysql has clustering built in, and Mysql instances can be interconnected across hosts, but when you scale there are other factors beyond the database, you have user data, session data, app data that needs to be kept in sync. Some apps are designed from ground up to be distributed and these problems are addressed inside the app, but most aren't. This is a persistent problem in scale out scenarios and there are a number of solutions.
SAN and NAS
Now most people are used to storage operating at around 100MBs for local disks and 500MBs for SSDs. The idea of having it on the network may seem going backwards but with enterprise SANs connecting over 10Gbit ethernet networks or 40Gbit plus Infiniband networks, you can get those sorts of speeds over the network so you don't need to have local storage. Also typically these are deployed on networks dedicated just for storage use. So with network storage your app instances can store their data in a single place that can be accessed by all nodes. Scaling out becomes that much easier.
SAN/NAS can be engineered to be highly available, redundant and with backups. From a management perspective it is much easier to manage storage in a central way rather than at the individual host level. With even 5 hosts, managing local storage in a predictable way can be a handful from a distributed, availability, redundancy and backup perspective.
With centralized storage over the network the individual hosts don't matter that much and can be added and removed at will providing a layer of flexibility. All the data is in the SAN or NAS including the VMs, that are deployed on the hosts and launched over network storage. This is where storage companies like EMC, Netapp etc make their zillions which newer entrants like Scale.io, Simplivity, Nutanix eye greedily. VMware also plays here with their VSAN solution.
In a lot of cases even this is not enough, for web scale companies like Google, Facebook operating tens of data centres with millions of users, SAN or NAS are bottlenecks, and they need to find more scale out distributed solutions. And this is also the approach solutions like Gluster take which is distributed file systems.
Its important to remember at this point most workloads are deployed in VMs, so most hosts have tons of VMs in them managed by a hypervisor like ESX or a local OS. And with the fantastic flexibility and performance of containers we will see more deployed for the same use cases. VMs themselves are deployed in shared network accessible storage. This underpins capabilities like VM mobility and live migration across hosts. With shared storage for containers you can of course launch containers from multiple hosts but uninterrupted live migration needs a few additional bits like copying state, maintaining network, which the new project by the LXC team, LXD should shortly bring.
This is why the underlying networking and storage layer and designs become important for scale out scenarios. Its gives you a lot of flexibility and redundancy.
More down to earth options
There are vendors like Synology, Qnap, Thessus who provided reasonably priced NAS units. However 10Gbit performance is inaccessible to all but those with the deepest pockets. Infiniband and 10Gbit ethernet remain pricey, so you are limited to 1Gbit networks which is basically 100MBs, much slower than your average SSD. You can use bonding to mitigate this somewhat by bonding say 4 1Gbit nics to take that 100MBs to around 300MBs with the right equipment but you have to remember this is bandwidth shared by all your hosts even if you have a dedicated storage network.
You can also build your own NAS unit with open source NAS oriented distros like FreeNAS, Nas4free, OMV, Nappit, Rockstor etc or explore options like DRBD with heartbeat for redundancy and failover.
You of course don't even need a NAS based distros, simply using Btrfs inbuilt raid and other advanced capabilities with standard tools like DRBD, Keepalived, Heartbeat etc can give you a high level of availability, fail over and redundancy.
Distributed or shared storage in the cloud?
Options for replication include longstanding Linux tools like DRBD, Lsyncd and distributed file systems like Gluster and Ceph. Gluster is great for replicating storage across a large number of nodes with the performance improving for every extra node added to Gluster.
What Gluster conceptually does is take local storage and replicate it across hosts and make it available over the network. It's a fantastic solution but the caveat is the performance for small files is not great, but it's improving rapidly.
Gluster
If you have a number of VMs you can use Gluster to make data available across your network and perhaps even deploy your container on distributed layer however the performance is not going to be great until Gluster make the libgfapi driver accessible beyond Qemu to NFS.
Gluster does seem to support the user land NFS implementation NFS Ganesha with libgfapi so this may be worth exploring and benchmarking. You can also use some iSCSI workarounds to use thelibgfapi driver. The performance benchmarks for libgfapi are excellent compared to fuse.
We are not going to go into too much detail on using Gluster in this post as we already have a indepth guide on how to use Gluster with LXC containers which would apply to your VMs or host too.
Let's also briefly look at using Lysncd and DRBD.
Lsyncd is a great little tool that use inotify to detect changes in the file system in real time and triggers rsync to do a sync. It's near real time, overhead is low and performance is quite good so it may be useful for some use cases.
You can even use DRBD over a loopback interface, but it's generally not recommended. DRBD is normally used in a network to connect 2 disks at the block level to provide some redundancy. We will cover it briefly to show how simple it is to set up a DRBD for redundancy.
Lsyncd
Let's quickly test out Lsyncd. We are using an Ubuntu Trusty host for this exercise and this should work on Debian Wheezy too. For other distros the specifics may vary. We are presuming the master IP is 10.0.0.2 and slave 10.0.0.3
First let's install Lsyncd on the host you want to be the master
apt-get install lynscd
Let's create some config files for lsyncd
mkdir /etc/lysncd/ touch /etc/lsyncd/lsyncd.conf.lua mkdir /var/log/lsyncd touch /var/log/lsyncd.{log,status}
Now let's configure the master to sync the folder /var/www across slaves. Copy the config below to the /etc/lsyncd/lsyncd.conf.lua file changing the target IP of the slaves where the data should be synched
nano /etc/lsyncd/lsyncd.conf.lua
settings { logfile = "/var/log/lsyncd/lsyncd.log", statusFile = "/var/log/lsyncd/lsyncd-status.log", statusInterval = 20 } sync { default.rsync, source="/var/www/", target="10.0.0.2:/var/www/", rsync = { compress = true, acls = true, verbose = true, rsh = "/usr/bin/ssh -p 22 -o StrictHostKeyChecking=no" } }
For multiple slaves you can repeat the 'sync' section of the config multiple times, changing the target IP for each slave. In our case we are using one slave as the target.
Now the last step. Before this works we need to generate certificates so the master can log in to the slaves automatically. On the master
ssh-keygen -t rsa
The ssh-keygen utility will ask some questions, go with the defaults, do not create a passkey. The certificate will be created in the /root/.ssh folder.
We are using root account. To run Lsyncd as an user, make sure the directories you are syncing are owned by the user running the lsyncd process, or else the process will not be able to write to the directories.
Now create a .ssh directory in the root folder of the slaves if it doesn't exist and change it's permissions
On the slaves
mkdir /root/.ssh nano /root/.ssh/authorized_keys chmod 600 /root/.ssh chmod 400 /root/.ssh/authorized_keys
Copy the /root/.ssh/id_rsa.pub file from the master to the /root/.ssh/authorized_keys file of all the slaves you intend to sync with.
Now log into each of the slaves with ssh from the master. In my case the slave is on 10.0.0.3.
You are all set. Start the lsyncd process on the master
service lsyncd start
Now head to the /var/www folder on the master and test the replication. Let's start by creating some files on the master and see if it replicates to the slave.
touch file{1..100}
It should in a couple of seconds. Now let's try a larger file.
dd if=/dev/zero of=testfile bs=1024k count=256
This too should replicate pretty quickly. As you can see you have near real time sync and you can extend it to multiple slaves and its not difficult to set up. The shortcoming with this set up is its one way master to slave, and writes to the slave will not be replicated to the master.
One workaround is to run Lsyncd on both hosts and try to run a master-master configuration, the caveat is this will probably not scale well beyond 2-3 hosts and there is plenty of potential for a split brain situation where the masters are not in sync.
You would need to install lsyncd on both hosts, set up the ssh-keygen on both sides so they can log into each other, and this is how the config would look.
settings { logfile = "/var/log/lsyncd/lsyncd.log", statusFile = "/var/log/lsyncd/lsyncd-status.log", statusInterval = 20 } sync { default.rsync, source="/var/www/", target="10.0.0.2:/var/www/", delete="running", rsync = { compress = true, acls = true, verbose = true, rsh = "/usr/bin/ssh -p 22 -o StrictHostKeyChecking=no" } }
Notice the additional 'delete=running' setting. This should work reasonably well across 2 hosts and you can probably extend it to 3 with the 2nd master syncing to the 3rd in a daisy chain, but beyond that there are a number of scenarios that would leave the shared folders in an inconsistent state or cause a race condition so one to many remains the recommended configuration for a robust setup.
This Lsyncd config below is also reported to work for bi-directional sync
settings { logfile = "/var/log/lsyncd/lsyncd.log", statusFile = "/var/log/lsyncd/lsyncd-status.log", statusInterval = 20 } sync { default.rsync, source="/var/www/", target="10.0.0.2:/var/www/", delay=10, rsync = { _extra = {"-ausS","--temp-dir=/tmp","-e","ssh -p 22" } }, }
Lsyncd has a number of options to explore to fine tune performance.
DRBD
DRBD is tried and tested but limited to 2 nodes. The 9.0 release expected soon will allow users to configure DRBD over multiple nodes. DRBD is supposed to be fast and has a number of config options to tweak its performance. Since this is block level the use case is limited to when you are running your own network and have access to disk at the block level.
With the primary secondary config only the primary block device of the pair is available for use, which can then be mounted over nfs for use by various hosts. The other is replicated and can be mounted for manual fail over or by using heartbeat for automatic failover and HA.
We are going to use a loopback device to show how easy it is setup DRBD, using DRDB over a loopback device is not recommended. We will do this exercise on 2 Ubuntu Trusty VMs in primary secondary mode and use a pretty simple config. Lets call VM1 'drbd01' and VM2 'drbd02' and are presuming a storage network on 10.0.0.2 and 10.0.0.3 for both hosts
Let's prepare both hosts
apt-get install drbd8-utils mkdir /srv/test & /srv/fs cd /srv/fs dd if=/dev/zero of=drbd.img bs=1024K count=1000 losetup /dev/loop0 /srv/fs/drbd.img
Now edit the /etc/hosts on both hosts and associate the IPs with the drbd01 and drbd02 hosts
nano /etc/hosts
Its should look like below this for both DRBD hosts
drbd01 10.0.0.2 drbd02 10.0.0.3
Also edit /etc/hostname on both hosts and associate each with their respective drbd name.
Now on both drbd01 and drbd02 create a config file for drbd
nano /etc/drbd.d/r0.res
Its should look like this
resource r0 { net { shared-secret "secret"; } on drbd01 { device /dev/drbd0; disk /dev/loop0; address 10.0.0.2:7788; meta-disk internal; } on drbd02 { device /dev/drbd0; disk /dev/loop0; address 10.0.0.3:7788; meta-disk internal; } }
The DRBD config operates on the concept of resources. Our resource as you can see in the config file is called 'r0' and the config file is also named 'r0.res'.
Now let's initialize the storage. On drbd01 and drbd02 nodes start the drbd service
/etc/initd./drbd start drbdadm create-md r0
You should get no errors. Now on the host you would like to make the primary node run the command below
drbdadm -- --overwrite-data-of-peer primary all
This should work without error. Incase you get an error run this
drbdadm adjust r0
Now you are all set and 2 block devices are connected to each other and replicating over the network
Let's format and mount the drbd device in the primary node drbd01
mkfs.ext4 /dev/drbd0 mount /dev/drbd0 /srv/test
Now let's go to the /srv/test folder we just mounted and run some tests
dd if=/dev/zero of=/testfile bs=1M count=256
We just created a 256MB file. Now lets see if its replicated. To do this we will temporarily manually switch over the secondary node and check. First on drbd01 unmount the /srv/test folder and use drbdadm to make the primary node secondary, and on drbd02 make it primary and mount the drbd device.
On drbd01
umount /srv/test drbdadm secondary r0
On drbd02
drbdadm primary r0 mount /srv/test
Now you can go to /srv/test folder and you should see your testfile replicated. Now let's make a new set of filse here and manually switch over back to the primary
touch file{1..100} umount /srv/test drbdadm secondary r0
Now back to drbd01 node
drbdadm primary r0 mount /srv/test
You should see the files you created in drbd01 now replicated in the folder.
Use drbd overview to get status
drbd-overview
We used a very basic config to show you how easy it is. To make this config persistent you will need to use an init script to mount the loop devices. Here is a pretty good guide on using DRBD with heartbeat HA and failover.
DRBD has a number of options you can use from protocols to configure async or sync modes, no barriers for more performance and more.
Myriad options
Lsyncd can be handy for smaller one off replication use cases both locally and on the cloud. Once course when using replication you have to remember you have to manage DB and session persistence too, with mysql clustering for instance and memcached for a typical php based app, sitting behind a Nginx or HAProxy load balancer configured with IP hashing.
But Gluster is really coming into its own. For both local and cloud configurations Gluster can deliver decent levels of performance, tons of flexibility and availability. With libgfapi the performance is now significantly enhanced letting you run a VM store in Gluster. You can use Gluster to build your VSAN or just a distributed resilient storage layer. The best thing is it's relatively easy to to setup and scale.