This is part 1 of a 2 part series. This post will define the problem we’re trying to solve, and part 2 will go into some details on a potential storage mechanism to make this a reality.
Suppose you’re working on a highly regulated piece of software. For example, something on a defense contract, or a medical device, or the space shuttle. One goal that most regulators will have is that we can fully determine how the software was built at any point in time. The gold standard for this is fully reproducible builds, where you get byte-identical artifacts for rerunning the build system at different times.
Not all of our build tools support that unfortunately, due to a variety of reasons which I’m not going to go into. The Debian project has been making great strides in that direction, as has NixOS. But let’s talk about a slightly weaker guarantee: reproducible build plans.
The idea here is simple: for a given set of source files, I can deterministically know exactly which versions of its dependencies will be used. Usually, there’s some kind of boundary to how deeply this determinism goes. For example, which of the following are determined:
NOTE Other things, like filesystem state, also apply. For that matter, in a crazy build system, GPS location could matter too, if it somehow affected the build. But these are some of the most common cases.
Building with something like Nix will guarantee determinism in the first two bullets. Docker can be (ab)used to give the same guarantee. Virtual machines can give guarantees about the kernel as well.
The rest of this blog post will talk about just that first bullet: language-specific source file determinism. That’s not because the other points are unimportant, but because:
I’ll primarily be talking about how this affects the Haskell world, and in particular the Stack build tool, but the ideas hopefully generalize well to other languages too.
A primary design goal in Stack is reproducible build plans,
usually (but not exclusively) provided via Stackage
Snapshots. These snapshots define a compiler version, a set of
packages and their versions*, and various configuration like build
flags. These snapshots are also immutable. Most users use the Long
Term Support (LTS) flavor of snapshots, and end up with a
stack.yaml
configuration file like the following:
resolver: lts-10.3
Stack knows where to download the lts-10.3.yaml
configuration file from (specifically, from a Github repo), and
takes care of that for you automatically. This looks perfectly
reproducible: LTS 10.3 is immutable, fully determines the exact
content of all of its packages, and the flags to provide to build
it. Given the same OS and same executable of the Stack build tool,
you should be able to make a very strong argument to a regulator
that this is a fully reproducible build plan… right?
* And for those familiar: also specifies Hackage revisions of the cabal file.
How do you know that LTS 10.3 is immutable? Easy: I just told you! And I am clearly:
lts-10.3.yaml
file. There are clearly no other people
with push access to the repo, or someone at Github with the ability
to override our access controls.leftpad
being used.Obviously, my goal as one of the Stackage Curators is to strive to deliver on the guarantees we’re claiming. We want snapshots to remain immutable for all time. But we can’t ignore the fact that some things are completely outside of our control. And a good regulator will notice and challenge this.
OK, let’s pretend for just a moment that you could convince your
regulator that snapshots are totally immutable and awesome. Next,
she’s going to open up that lts-10.3.yaml
file and see
something along the lines of:
compiler: ghc-8.2.2
packages:
- name: foobar
version: 1.2.3
flags:
be-awesome: true
# And lots and lots and lots more
# Note that our config files in practice look
# nothing like this :)
I imagine a conversation going something like this:
Regulator: Alright, how do you know what
foobar-1.2.3
is?
Developer: Well, obviously you go to your package index… which
isn’t specified in the snapshot file, of course. It’s specified in
Stack’s global config. Regulator: Why?
Developer: Well, it allows people to more easily host mirrors.
Regulator: So you mean if you change some other config file, it can
totally change which foobar-1.2.3
is used?
Developer: Yeah, but that’s totally a feature, not a bug. And
anyway, we guarantee in our build process that this doesn’t
happen.
Regulator: OK. Fine. And how do you know that when you download
foobar-1.2.3
that it contains the exact same content
at all time?
Developer: Oh, remember how I told you that Michael’s a real
trustworthy guy and runs Stackage Snapshots? Yeah, same for the
Hackage package index.
Notice the pattern here. Even taken as a given that everyone wants to work towards immutability, if my job is to make guarantees to a regulator, everyone’s best intentions are irrelevant.
The solution to this is relatively straightforward. Instead of trusting some arbitrary identifier which gives no guarantees of the file contents, let’s consider this reality instead:
resolver:
name: lts-10.3 # display purposes only
sha256: bd7a6cbf8bce34086aff452c03ae1f3d8e0bbe9427f753936fabcdd797848d06
bytes: 6693345 # byte count, avoid an overflow attack
Now the conversation with the regulator:
Regulator: How do we know what lts-10.3 is? Developer: We don’t, and we don’t care. Regulator: What’s it there for? Developer: Documentation purposes only. Regulator: OK, and how do we know that we have the right snapshot content? Developer: We perform a cryptographic hash on the file contents and ensure it matches the hash we placed in our config file.
This depends on trusting cryptographic hashes (which most
regulators are willing to do in my experience), and on having some
way of finding the config file based on the cryptographic hash
(more on that in a bit). And for that second bit, we have a
guarantee that the snapshot cannot be changed without detection,
which is not the case with lts-10.3
as the only
identifier.
Similarly, we would want to extend the snapshot format itself to retain this metadata:
packages:
- name: foobar
version: 1.2.3
flags:
be-awesome: true
sha256: 7cbb4101018cdc92244c321db8c99b709681f4b7c3ea2ce24ceea0d9ae1595ce
bytes: 1234
There’s no way for an attacker to slip in a nefarious
foobar-1.2.3
without breaking SHA256 security. And
once again, we can use the hash for performing downloads, and then
simply verify the contents actually contain
foobar-1.2.3
by inspecting metadata (in Haskell land:
the cabal package file).
I love typing in resolver: lts-10.3
: it’s easy to
remember, quick, and explains exactly what I want. But easy and
quick are not the cornerstones of regulated software. To make this
story more palatable, we could easily add some tooling support,
e.g.:
stack add-hashes
, which modifies a
stack.yaml
to add the cryptographic hashes to a
stack.yaml
file--verified
mode (or similar) that refuses to
download anything that doesn’t have a cryptographic hash to back it
upThese could even be provided outside of the build tool itself, there’s no necessity for it being in Stack.
This may be a specific quirk of Haskell, but I’ll spell it out here anyway. It’s common in Haskell build tools to want to analyze the build metadata files (cabal package files) to determine dependency trees. Therefore, we’d want to support downloading them separately, e.g.:
packages:
- name: foobar
version: 1.2.3
flags:
be-awesome: true
sha256: 7cbb4101018cdc92244c321db8c99b709681f4b7c3ea2ce24ceea0d9ae1595ce
bytes: 1234
cabal-file:
sha256: 7cbb4101018cdc92244c321db8c99b709681f4b7c3ea2ce24ceea0d9ae1595ce
bytes: 10685
This allows us to download the metadata without downloading the entire package. Also, for those familiar with it, this provides a robust way to handle Hackage file revisions.
In the next post, we’ll discuss how to create a storage system that can provide downloads of packages, package metadata, and snapshot definitions. Stay tuned!
If you liked this article you may also like:
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.