The Linux Tux (Penguin) and the Docker Whale

This post tells the story of some missing data and how it turned out to be related to how we built our Docker images.

To give you the proper context it will talk about:

  • the problem
  • the root cause
  • how we fixed it
  • what we learned from this

Grab a coffee and make yourself comfortable, we’re going iiiiiiiiin!

The Mystery of the Missing Data

Recently a colleague reported some inconsistent data in two distinct parts of our system. Some data was missing that wasn’t supposed to be missing.

Without going into too much detail, this data is supposed to be synchronized through an eventing mechanism. One of our services emits events which another service picks up.

After digging a bit through the logs of both services we concluded that the receiving service never got the events in question, which means that the emitting service was misbehaving.

Where are my events!?

The emitting service is written in Ruby and emits the event in a Sidekiq job.

After some more log-digging it turned out that these jobs were running when the worker was shut down by Kubernetes. It seems like shutting down the service’s worker while it’s performing jobs leads to data loss?

What are you doing, Sidekiq!?

Angry developer yells at open source software

Surely that can’t be it, right? Surely a well maintained background processing library such as Sidekiq doesn’t lose jobs on shutdown?

Well, yes and no. The Deployment page in the Sidekiq wiki explains it quite well:

To safely shut down Sidekiq, you need to send it the TSTP signal as early as possible in your deploy process and the TERM signal as late as possible. TSTP tells Sidekiq to stop pulling new work and finish all current work. TERM tells Sidekiq to exit within N seconds, where N is set by the -t timeout option and defaults to 25. Using TSTP+TERM in your deploy process gives your jobs the maximum amount of time to finish before exiting.

If any jobs are still running when the timeout is up, Sidekiq will push those jobs back to Redis so they can be rerun later.

Your deploy scripts must give Sidekiq N+5 seconds to shutdown cleanly after the TERM signal. For example, if you send TERM and then send KILL 10 seconds later, you will lose jobs (if using Sidekiq) or duplicate jobs (if using Sidekiq Pro’s super_fetch).

Where especially the last sentence is interesting:

For example, if you send TERM and then send KILL 10 seconds later, you will lose jobs (if using Sidekiq) or duplicate jobs (if using Sidekiq Pro’s super_fetch).

So it seems like Sidekiq gets killed through a SIGKILL, but why?

Docker, the ENTRYPOINT, and PID1

Let me skip a bit ahead.

Our Docker image for the service in question contains a docker-entrypoint.sh bash script which - as the name suggests - is the ENTRYPOINT of the image.

A (very simplified) version of this script looks like this:

#!/bin/bash -eu

for cmd in "$@"; do
  case "$cmd" in
    bash) bash -i;;
    console) bin/rails console;;
    migrate) bin/rails db:migrate;;
    web) bin/rails server -p "${PORT:-3000}";;
    worker) bundle exec sidekiq -q default -q mailers;;
  esac
done

The for allows us to pass multiple commands, for example docker-entrypoint.sh migrate web, which would first run the migrations, and only then start rails. It’s super handy, and will become relevant at a later point, so I’ve decided to leave the for in there.

Alright but what does this mean in practice? What’s actually going on in the container? Let’s take a peek at which processes are running when we invoke docker run my-container worker:

PID ... COMMAND
  1 ... /bin/bash -e ./docker-entrypoint.sh worker
  8 ... sidekiq 5.0.5 app [0 of 5 busy]

Interesting. We have bash running as PID1 and sidekiq as PID8. Nothing out of the ordinary, right?

Well, not quite. It turns out that PID1 has great power, and with great power comes great responsibility.

Linux and PID1

In Unix-based operating systems, PID1 gets some special love:

  1. PID1 is expected to reap zombie processes (which is super metal)
  2. PID1 doesn’t get the default signal handling, which means it won’t terminate on SIGTERM or SIGINT unless it explicitly registers handlers to do so (not so metal)
  3. When PID1 dies all remaining processes are killed with SIGKILL, which cannot be trapped (very unmetal)

Alright, so which implications does that have when bash is PID1, as in our case?

  1. bash actually does zombie reaping!
  2. bash also registers signal handlers! Even when it’s PID1, it reacts as expected to SIGTERM by shutting down.
  3. It turns out, while bash does handle SIGTERM it does not wait for its children to exit, which means those children get brutally murdered by SIGKILL.

And now it makes sense that sidekiq doesn’t get a chance to shutdown gracefully. So what can we do about it?

While we actually could teach bash to forward signals to its children, doing so is fairly brittle and a bit messy.

Instead let’s try to remove bash from the equation.

Bashing bash

How do we remove bash from the equation, you ask?

exec is our friend here! When using exec it replaces the bash process with whatever process the given command spawns.

With this in mind, let’s update our docker-entrypoint.sh script!

#!/bin/bash -eu

for cmd in "$@"; do
  case "$cmd" in
    bash) exec bash -i;;
    console) exec bin/rails console;;
    migrate) bin/rails db:migrate;;
    web) exec bin/rails server -p "${PORT:-3000}";;
    worker) exec bundle exec sidekiq -q default -q mailers;;
  esac
done

But wait, we didn’t exec the migrate command? Yes, and for good reason.

Remember: exec does replace the bash process. And since bash is no more, it also won’t continue to execute our script.

If we’d execed every single command, it would basically defeat the purpose of our for-loop since something like docker-entrypoint.sh migrate web would become impossible.

Assuming you use a script like the one above, a good rule of thumb is to only exec “long running” commands. That is commands which you’d arguably put at the end of the “command chain”, such as web or worker.

Alright, but does this solve our immediate issue? Does this ensure that sidekiq shuts down gracefully?

Actually, yes!

But it also means that sidekiq now became PID1, which - as you might remember - comes with great responsibilities.

That’s Not My Responsibility

As you might remember from earlier, there are a few things which are special about PID1:

  1. it needs to reap zombies
  2. it needs to explicitly register signal handlers (e.g. for SIGTERM)
  3. when it dies, everything else will be killed with SIGKILL

And while sidekiq actually does register signal handlers it doesn’t do zombie reaping, which could or could not become a problem down the road.

Luckily there exists a solution to this problem: tini.

tini is a super minimalistic init system, specifically written for Docker containers, and as such a perfect fit for the job.

Let’s put it to work, shall we?

tini in Action

To use tini we need to include it in our Docker image and set it as ENTRYPOINT of our container:

  FROM ruby:2.6

+ RUN apt-get update -qq \
+  && apt-get install -qq --no-install-recommends \
+       tini \
+  && apt-get clean \
+  && rm -rf /var/lib/apt/lists/*

  # Copy app, install gems etc ...

- ENTRYPOINT ["./docker-entrypoint.sh"]
+ ENTRYPOINT ["tini", "--", "./docker-entrypoint.sh"]

Note the usage of [...] in the ENTRYPOINT; it’s important, see this article for the why.

And that’s it. Nothing more, nothing less.

What We’ve Learned

With these changes, our app becomes a well behaving citizen in Docker City. No more lost jobs, no more missing data.

Let’s revisit what we’ve learned:

  • whatever you put into your Docker ENTRYPOINT runs as PID1
  • PID1 is special
  • bash does not forward signals to its children before dying
  • use exec to let your app replace bash
  • let tini be a good PID1 citizen

Resources