Tiny Docker Images for Scala Native with Multi-Stage Builds

I've been really excited by the rapid progress of Scala Native since its initial release just a few months ago. As a systems-oriented Scala hacker, I'm eager to use my favorite language for small, standalone tools, without some of the downsides of the JVM.

In this post, I'll explain how to build tiny Docker images with Scala Native--in the example repo here, we can reduce the size of a running image from 680MB down to 16MB. To do this, we'll use Alpine Linux, multi-stage Docker builds, and some "fun" Linux binary and symbol-table hacking.

I'd also like to take this opportunity to thank my colleague Justin Nauman for pointing me toward the multi-stage build technique, and Alex Ellis for his excellent blog post on the topic.

Native Binaries and Dynamic Linking

A Scala Native build is relatively complex, and has several stages:

.scala source code is compiled into .nir (Native Intermediate Representation) files
.nir files are compiled into .ll (LLVM) files
.ll files are compiled in .o (Native Binary) files
.o files are linked into a final executable

However, this process isn't necessarily enough to create a true self-contained binary, because of dynamic linking. Essentially, the linker marks the output binary executable file with references to shared library files that must be present on the system for the program to execute -- typically, these libraries have extensions like .dylib, .so, and the notorious .dll. Then, at run time, the dynamic program loader will load the executable and link in the shared libraries. Unsurprisingly, this can be quite error-prone.

To see this in action, you can use the Linux ldd utility to print out the dynamic libraries of any binary program. For example, here are the dynamic libraries for git on Alpine 3.3:

Which is telling us that git links to the shared libraries for PCRE, zip, crypto, and the standard libc, and if you look closely, you'll see that it also points at specific versions at concrete file paths. This is a huge obstacle toward building portable unix binaries. Even worse, the version of libc in use here, musl, isn't quite compatible with the more common glibc.

In contrast, here's the output of the same command on Ubuntu 17.04:

We can see here that we have a different set of dependencies -- including pthread, librt, and the special vdso -- in different versions. As a result, if we try to run the Ubuntu executable on the Alpine system, this happens:

Which is completely unhelpful, but if we use ldd we can see what's happening:

As you can see, it's failing to load the PCRE library, and then choking on all the missing symbols, i.e., functions.

Static Linking and Platform Lock-in

At this point, you may be wondering how it's possible to build portable binary software distributions for UNIX at all. The standard technique is static linking, which essentially copies the body of the shared library code into the output executable. The catch is that the process to perform this static linking is itself incredibly fiddly and platform-dependent, especially if you consider UNIX variants like Mac OS.

For this reason, there are very few platforms that can do static linking in a platform-neutral way. Most simply give you a hook for passing platform-specific linker options, like Golang, Rust, and indeed Scala Native. The catch, though, is that as soon as you start writing platform-specific linker flags into your build, you've either locked yourself into a single platform, or else committed yourself to maintaining N complex build files.

The Minimum Viable Docker Image, and Build-time vs. Run-time Dependencies

So, to avoid the quagmire of platform-specific build config, we're going to try to create a minimal, reproducible environment that can build and run our program. We'll start with Alpine Linux, which is an awesome minimal base distribution in just 2MB. We'll then install the tools we need and build our software. To achieve portability, we'll rely on Docker to fully virtualize the filesystem and library dependencies.

Looking at Dockerfile.alpine.big in our repo, before we compile our program, we need:

java
scala
sbt
C build tools
LLVM
git
wget
all shared libraries
source code for RE2

These dependencies add up: even though the final executable is just 5.3MB, this image weighs in at 680MB(!), which is definitely in the undesirable range. How do we trim this down?

The traditional way to do this is to use two Dockerfiles: one to perform the build, a slimmer image for run-time, plus a shell script or two to stitch them together. This works, but it can be error-prone to maintain, especially for large projects with complex dependencies and multiple subsystems. However, recent updates to Docker give us an alternative approach: multi-stage builds. To use this, you'll need a very recent Docker distribution from the Edge channel.

Multi-stage Docker Builds

Multi-stage builds are a new Docker feature that let you fully automate complex build workflows in a single Dockerfile. The linked article demonstrates how to build a super-slim Go executable like this:

Essentially, the second FROM directive lets your build start over from a clean slate, but then move files from previous stages with the COPY --from=builder directive. This is basically what we'll do for our app, just with more steps.

Whew! Once we run this with docker build -t scala-native-alpine ., we can check out the resulting image:

Just 16MB, which is more than a 43x reduction in size from 680MB!

Finally, we're ready to run the binary, which should take about a minute:

You can look at the code for the example Scala Native project, but this is essentially doing a bunch of C-level math operations to render a 800 x 600 image file.

Next Steps

If you've followed along this far, you've learned how to build tiny Scala Native images with Docker and Alpine Linux. The big benefits of this approach are that the whole build lifecycle is encapsulated in a single Dockerfile. This will help us out immensely when building more complex applications.

In a future post, I'd like to write about Dinosaur, a simple CGI-based web framework for Scala Native, and some of the quirks of multi-process web programming from a systems perspective. In the meantime, we'd love to hear from you on Twitter if you've found this post useful--reach us at @spantreellc and @RichardWhaling!