For a while now, Ive been using Docker to deploy containers to a number of CoreOS clusters and while its very convenient (kind of a boot the machine and you're ready to deploy type situation) there are some kinks in the system, particularly with how Docker and Systemd play (or fight) with each other.
For the unfamiliar, "CoreOS is an open source lightweight operating system based on the Linux kernel and designed for providing infrastructure to clustered deployments, while focusing on automation, ease of applications deployment, security, reliability and scalability." One of the important things that comes packaged with it is systemd.
systemdis a suite of basic building blocks for a Linux system. It provides a system and service manager that runs as PID 1 and starts the rest of the system.
systemdprovides aggressive parallelization capabilities, uses socket and D-Bus activation for starting services, offers on-demand starting of daemons, keeps track of processes using Linux control groups, supports snapshotting and restoring of the system state, maintains mount and automount points and implements an elaborate transactional dependency-based service control logic
Basically you get a linux kernel, an init system (systemd), the tools the CoreOS folks provide, and Docker (among some other basic utilities like vim) with the assumption that anything else you need will be installed and deployed via containers.
This is all pretty awesome and convenient; until you start trying to deploy your Docker containers with something like fleet. At that point systemd and Docker don't exactly play nice with each other.
systemd vs. the Docker daemon
Fleet is basically an interface for communicating with systemd on all of the nodes in your cluster. When you schedule a unit, that unit file is dropped onto a machine of fleet's choosing and then executed and managed through systemd. Systemd, being an init system, already knows how to manage processes and restart/stop them when necessary. Docker containers however, rely on the Docker daemon which is itself a kind of pseudo init system as it manages all of the processes run through it.
This means when you go to start a unit, you have to also write a bunch of scripts to make sure Docker manages its processes properly and cleans up after itself (Docker is very messy and likes leaving things all over the place).
So how do we fix this?
One init system to rule them all
Systemd has a lot of goodies that are baked in from the beginning. One of those is a utility called
systemd-nspawn. Well what the hell is it?
systemd-nspawn may be used to run a command or OS in a light-weight namespace container. It is more powerful than chroot since it fully virtualizes the file system hierarchy, as well as the process tree, the various IPC subsystems and the host and domain name.
Cool, sounds exactly what we want. If you look at a lot of Docker containers, I would say a good majority of them build off some kind of base system, be it Ubuntu, Debian, Fedora, etc. In the most basic sense, this is just a file system that you build up using the Dockerfile and
docker build process. We're going to walk through how to build a container, extract the filesystem, and run it using
Building the container
We're going to build a really simple container based off Fedora 21. The script we include is just a bash script that will print the date every 5 seconds.
FROM fedora:21ADD run.sh /RUN chmod +x /run.sh
while true; do $(which date) $(which sleep) 5 done
Notice how in the Dockerfile we didnt include a
CMD command at the bottom. This is because we're just using Docker to build the filesystem we will extract;
systemd-nspawn doesn't know about all of the bells and whistles built into Docker. It just knows how to run what you tell it.
Im currently using Quay.io for all my hosting, and you can actually pull and use the container Im building in this post. If you're not using Quay, or are using the Docker registry, just substitute the URL with the one that points to your container.
Now that we have our Dockerfile and run script, we can build the container:
docker build -t quay.io/seanmcgary/nspawn-test .
At this point, we could run our container using the Docker daemon if we wanted to like so:
docker run -i -t --name=test quay.io/seanmcgary/nspawn-test /run.sh
Extracting the filesystem
Now that we have a container, we can export/extract the filesystem from it. There are a few steps that are bundled in to one here:
docker create <container> <command>will initialize the container for the first time and thus create the filesystem. The command on the end can literally be anything, and as far as I can tell it doesn't even have to be valid
- Docker export takes the ID returned from the and spits out a compressed image
- We then pipe this compressed image to
tarwhich we tell to put in a directory called
mkdir nspawntest docker export "$(docker create --name nspawntest quay.io/seanmcgary/nspawn-test true)" | tar -x -C nspawntest docker rm nspawntest
We now have ourselves a filesystem:
tree -L 2 . `-- nspawntest |-- bin -> usr/bin |-- boot |-- dev |-- etc |-- home |-- lib -> usr/lib |-- lib64 -> usr/lib64 |-- lost+found |-- media |-- mnt |-- nspawntest_new |-- opt |-- proc |-- root |-- run |-- run.sh |-- sbin -> usr/sbin |-- srv |-- sys |-- tmp |-- usr `-- var
Running the machine
Now that we have a Fedora filesystem just sitting here, we can point
systemd-nspawn at it and tell it to run our
sudo systemd-nspawn --machine nspawntest --directory nspawntest /run.sh
core@coreoshost ~ $ sudo systemd-nspawn --machine nspawntest --directory nspawntest /run.sh Spawning container nspawntest on /home/core/nspawntest. Press ^] three times within 1s to kill container. Thu Feb 26 18:19:58 UTC 2015 Thu Feb 26 18:20:03 UTC 2015 Thu Feb 26 18:20:08 UTC 2015
Whenever you create a machine with
systemd-nspawn it will show up when you run
core@coreoshost ~ $ machinectl MACHINE CONTAINER SERVICE nspawntest container nspawn 1 machines listed.
Now, if we want to stop our script from running, we can do so by using the
machinectl terminate command:
sudo machinectl terminate nspawntest
Making it deployable
Now that we know how to run this on its own, we can easily write out a unit file that can then be started via systemd directly or passed to fleet to be scheduled on your cluster:
[Unit]Description=nspawntest After=docker.service Requires=docker.service [Service]User=core ExecStartPre=/bin/bash -c 'docker pull quay.io/seanmcgary/nspawn-test:latest || true'ExecStartPre=/bin/bash -c 'mkdir /home/core/containers/nspawntest_new || true'ExecStartPre=/bin/bash -c 'docker export "$(docker create --name nspawntest quay.io/seanmcgary/nspawn-test true)" | tar -x -C /home/core/containers/nspawntest_new'ExecStartPre=/bin/bash -c 'docker rm nspawntest || true'ExecStartPre=/bin/bash -c 'mv /home/core/containers/nspawntest_new /home/core/containers/nspawntest_running'ExecStart=/bin/bash -c 'sudo systemd-nspawn --machine nspawntest --directory /home/core/containers/nspawntest_running /run.sh'ExecStop=/bin/bash -c 'sudo machinectl terminate nspawntest'TimeoutStartSec=0Restart=always RestartSec=10s
In this unit file, we are basically doing everything that we did above by hand:
- Pull in the latest version of the container
- Create a directory to extract the container to
- Create and export the container via docker, piping the contents through tar to unpack them
- Do a little bit of Docker cleanup, removing the now un-needed container
- Run the container using
- If systemd is told to stop the container, make a call to
machinectlto terminate the container by the name that we gave it.
If all goes to plan, you should see the following output when you tail the journal for your unit:
Feb 26 19:24:12 coreoshost systemd: Starting nspawntest... Feb 26 19:24:12 coreoshost bash: Pulling repository quay.io/seanmcgary/nspawn-test Feb 26 19:24:13 coreoshost bash: a22582cd26be: Pulling image (latest) from quay.io/seanmcgary/nspawn-test Feb 26 19:24:13 coreoshost bash: a22582cd26be: Pulling image (latest) from quay.io/seanmcgary/nspawn-test, endpoint: https://quay.io/v1/ Feb 26 19:24:13 coreoshost bash: a22582cd26be: Pulling dependent layers Feb 26 19:24:13 coreoshost bash: 511136ea3c5a: Download complete Feb 26 19:24:13 coreoshost bash: 00a0c78eeb6d: Download complete Feb 26 19:24:13 coreoshost bash: 834629358fe2: Download complete Feb 26 19:24:13 coreoshost bash: 478c125478c6: Download complete Feb 26 19:24:13 coreoshost bash: a22582cd26be: Download complete Feb 26 19:24:13 coreoshost bash: a22582cd26be: Download complete Feb 26 19:24:13 coreoshost bash: Status: Image is up to date for quay.io/seanmcgary/nspawn-test:latest Feb 26 19:24:46 coreoshost bash: nspawntest Feb 26 19:24:46 coreoshost systemd: Started nspawntest. Feb 26 19:24:46 coreoshost sudo: core : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/systemd-nspawn --machine nspawntest --directory /home/core/containers/nspawntest_running /run.sh Feb 26 19:24:46 coreoshost echo: Running systemd-nspawn Feb 26 19:24:46 coreoshost sudo: Spawning container nspawntest on /home/core/containers/nspawntest_running. Feb 26 19:24:46 coreoshost sudo: Press ^] three times within 1s to kill container. Feb 26 19:24:46 coreoshost sudo: Thu Feb 26 19:24:46 UTC 2015Feb 26 19:24:51 coreoshost sudo: Thu Feb 26 19:24:51 UTC 2015Feb 26 19:24:56 coreoshost sudo: Thu Feb 26 19:24:56 UTC 2015Feb 26 19:25:01 coreoshost sudo: Thu Feb 26 19:25:01 UTC 2015
This is just the very tip of the iceberg when it comes to running things with
systemd-nspawn. There are a lot of other options you can configure when running your container, like permissions, network configurations, journal configurations, etc. I highly suggest taking a look at the docs to see what there is.
Now that we know how to run a container via
systemd-nspawn, next time we'll look at running systemd within a container using
systemd-nspawn so that we can manage multiple processes.