For a while now, Ive been using Docker to deploy containers to a number of CoreOS clusters and while its very convenient (kind of a boot the machine and you're ready to deploy type situation) there are some kinks in the system, particularly with how Docker and Systemd play (or fight) with each other.

For the unfamiliar, "CoreOS is an open source lightweight operating system based on the Linux kernel and designed for providing infrastructure to clustered deployments, while focusing on automation, ease of applications deployment, security, reliability and scalability." One of the important things that comes packaged with it is systemd.

systemd is a suite of basic building blocks for a Linux system. It provides a system and service manager that runs as PID 1 and starts the rest of the system. systemd provides aggressive parallelization capabilities, uses socket and D-Bus activation for starting services, offers on-demand starting of daemons, keeps track of processes using Linux control groups, supports snapshotting and restoring of the system state, maintains mount and automount points and implements an elaborate transactional dependency-based service control logic

Basically you get a linux kernel, an init system (systemd), the tools the CoreOS folks provide, and Docker (among some other basic utilities like vim) with the assumption that anything else you need will be installed and deployed via containers.

This is all pretty awesome and convenient; until you start trying to deploy your Docker containers with something like fleet. At that point systemd and Docker don't exactly play nice with each other.

systemd vs. the Docker daemon

Fleet is basically an interface for communicating with systemd on all of the nodes in your cluster. When you schedule a unit, that unit file is dropped onto a machine of fleet's choosing and then executed and managed through systemd. Systemd, being an init system, already knows how to manage processes and restart/stop them when necessary. Docker containers however, rely on the Docker daemon which is itself a kind of pseudo init system as it manages all of the processes run through it.

This means when you go to start a unit, you have to also write a bunch of scripts to make sure Docker manages its processes properly and cleans up after itself (Docker is very messy and likes leaving things all over the place).

So how do we fix this?

One init system to rule them all

Systemd has a lot of goodies that are baked in from the beginning. One of those is a utility called systemd-nspawn. Well what the hell is it?

systemd-nspawn may be used to run a command or OS in a light-weight namespace container. It is more powerful than chroot since it fully virtualizes the file system hierarchy, as well as the process tree, the various IPC subsystems and the host and domain name.

Cool, sounds exactly what we want. If you look at a lot of Docker containers, I would say a good majority of them build off some kind of base system, be it Ubuntu, Debian, Fedora, etc. In the most basic sense, this is just a file system that you build up using the Dockerfile and docker build process. We're going to walk through how to build a container, extract the filesystem, and run it using systemd-nspawn.

Building the container

We're going to build a really simple container based off Fedora 21. The script we include is just a bash script that will print the date every 5 seconds.


FROM fedora:21


RUN chmod +x /

#! /bin/bash

while true; do
    $(which date)
    $(which sleep) 5

Notice how in the Dockerfile we didnt include a CMD command at the bottom. This is because we're just using Docker to build the filesystem we will extract; systemd-nspawn doesn't know about all of the bells and whistles built into Docker. It just knows how to run what you tell it.

Im currently using for all my hosting, and you can actually pull and use the container Im building in this post. If you're not using Quay, or are using the Docker registry, just substitute the URL with the one that points to your container.

Now that we have our Dockerfile and run script, we can build the container:

docker build -t .

At this point, we could run our container using the Docker daemon if we wanted to like so:

docker run -i -t --name=test /

Extracting the filesystem

Now that we have a container, we can export/extract the filesystem from it. There are a few steps that are bundled in to one here:

  • Running docker create <container> <command> will initialize the container for the first time and thus create the filesystem. The command on the end can literally be anything, and as far as I can tell it doesn't even have to be valid
  • Docker export takes the ID returned from the and spits out a compressed image
  • We then pipe this compressed image to tar which we tell to put in a directory called nspawntest
mkdir nspawntest
docker export "$(docker create --name nspawntest true)" | tar -x -C nspawntest
docker rm nspawntest

We now have ourselves a filesystem:

tree -L 2
`-- nspawntest
    |-- bin -> usr/bin
    |-- boot
    |-- dev
    |-- etc
    |-- home
    |-- lib -> usr/lib
    |-- lib64 -> usr/lib64
    |-- lost+found
    |-- media
    |-- mnt
    |-- nspawntest_new
    |-- opt
    |-- proc
    |-- root
    |-- run
    |-- sbin -> usr/sbin
    |-- srv
    |-- sys
    |-- tmp
    |-- usr
    `-- var

Running the machine

Now that we have a Fedora filesystem just sitting here, we can point systemd-nspawn at it and tell it to run our script.

sudo systemd-nspawn --machine nspawntest --directory nspawntest /
core@coreoshost ~ $ sudo systemd-nspawn --machine nspawntest --directory nspawntest /
Spawning container nspawntest on /home/core/nspawntest.
Press ^] three times within 1s to kill container.
Thu Feb 26 18:19:58 UTC 2015
Thu Feb 26 18:20:03 UTC 2015
Thu Feb 26 18:20:08 UTC 2015

Whenever you create a machine with systemd-nspawn it will show up when you run machinectl

core@coreoshost ~ $ machinectl
MACHINE                          CONTAINER SERVICE         
nspawntest                       container nspawn          

1 machines listed.

Now, if we want to stop our script from running, we can do so by using the machinectl terminate command:

sudo machinectl terminate nspawntest

Making it deployable

Now that we know how to run this on its own, we can easily write out a unit file that can then be started via systemd directly or passed to fleet to be scheduled on your cluster:


ExecStartPre=/bin/bash -c 'docker pull || true'
ExecStartPre=/bin/bash -c 'mkdir /home/core/containers/nspawntest_new || true'
ExecStartPre=/bin/bash -c 'docker export "$(docker create --name nspawntest true)" | tar -x -C /home/core/containers/nspawntest_new'
ExecStartPre=/bin/bash -c 'docker rm nspawntest || true'
ExecStartPre=/bin/bash -c 'mv /home/core/containers/nspawntest_new /home/core/containers/nspawntest_running'

ExecStart=/bin/bash -c 'sudo systemd-nspawn --machine nspawntest --directory /home/core/containers/nspawntest_running /'

ExecStop=/bin/bash -c 'sudo machinectl terminate nspawntest'


In this unit file, we are basically doing everything that we did above by hand:

  • Pull in the latest version of the container
  • Create a directory to extract the container to
  • Create and export the container via docker, piping the contents through tar to unpack them
  • Do a little bit of Docker cleanup, removing the now un-needed container
  • Run the container using systemd-nspawn
  • If systemd is told to stop the container, make a call to machinectl to terminate the container by the name that we gave it.

If all goes to plan, you should see the following output when you tail the journal for your unit:

Feb 26 19:24:12 coreoshost systemd[1]: Starting nspawntest...
Feb 26 19:24:12 coreoshost bash[4864]: Pulling repository
Feb 26 19:24:13 coreoshost bash[4864]: a22582cd26be: Pulling image (latest) from
Feb 26 19:24:13 coreoshost bash[4864]: a22582cd26be: Pulling image (latest) from, endpoint:
Feb 26 19:24:13 coreoshost bash[4864]: a22582cd26be: Pulling dependent layers
Feb 26 19:24:13 coreoshost bash[4864]: 511136ea3c5a: Download complete
Feb 26 19:24:13 coreoshost bash[4864]: 00a0c78eeb6d: Download complete
Feb 26 19:24:13 coreoshost bash[4864]: 834629358fe2: Download complete
Feb 26 19:24:13 coreoshost bash[4864]: 478c125478c6: Download complete
Feb 26 19:24:13 coreoshost bash[4864]: a22582cd26be: Download complete
Feb 26 19:24:13 coreoshost bash[4864]: a22582cd26be: Download complete
Feb 26 19:24:13 coreoshost bash[4864]: Status: Image is up to date for
Feb 26 19:24:46 coreoshost bash[4916]: nspawntest
Feb 26 19:24:46 coreoshost systemd[1]: Started nspawntest.
Feb 26 19:24:46 coreoshost sudo[4932]: core : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/systemd-nspawn --machine nspawntest --directory /home/core/containers/nspawntest_running /
Feb 26 19:24:46 coreoshost echo[4926]: Running systemd-nspawn
Feb 26 19:24:46 coreoshost sudo[4932]: Spawning container nspawntest on /home/core/containers/nspawntest_running.
Feb 26 19:24:46 coreoshost sudo[4932]: Press ^] three times within 1s to kill container.
Feb 26 19:24:46 coreoshost sudo[4932]: Thu Feb 26 19:24:46 UTC 2015
Feb 26 19:24:51 coreoshost sudo[4932]: Thu Feb 26 19:24:51 UTC 2015
Feb 26 19:24:56 coreoshost sudo[4932]: Thu Feb 26 19:24:56 UTC 2015
Feb 26 19:25:01 coreoshost sudo[4932]: Thu Feb 26 19:25:01 UTC 2015

Wrap up

This is just the very tip of the iceberg when it comes to running things with systemd-nspawn. There are a lot of other options you can configure when running your container, like permissions, network configurations, journal configurations, etc. I highly suggest taking a look at the docs to see what there is.

Now that we know how to run a container via systemd-nspawn, next time we'll look at running systemd within a container using systemd-nspawn so that we can manage multiple processes.