Back in January, I published a two part blog post on hash-based package downloads. Some project needs at FP Complete have pushed this to the forefront recently, and as a result I’ve gotten started on implementing these ideas. I’m hoping to publish regular blog posts on the topic as I continue implementation.
There are a few major goals in the refactoring I’m working on:
Today’s post won’t hit on all of these points, as I’m only going
to discuss the first bit of rewrite I’ve completed: package index
management. This work is occurring on the pantry branch of Stack, though you
should be well aware that that branch is currently totally unusable
outside of the stack update
command.
A package index is a term that comes from the Cabal and Hackage
worlds. Hackage itself provides a package index, and
cabal-install
and Stack both download this index to
discover packages. The index itself is a tarball (the
01-index.tar
file) containing a single cabal file for
each revision of a package/version combination. It also contains
some other metadata files, like JSON files providing cryptographic
hash information on package tarballs. The 01-index.tar
file is intended to be downloaded by hackage-security, which provides both security
(signature checking and other protections) and resumable
downloads.
In its common use case, Stack discovers available packages via a
snapshot configuration (e.g., lts-12.0
), which tells
it the name, version, and Hackage revision of any package
available. As a result, it may seem like Stack doesn’t really need
the package index. However, it’s still necessary for a few
things:
cabal-install
, we must
have an index available so that the solver can discover new
packagesStack will automatically download the index today when needed
(e.g., a snapshot refers to a revision not yet downloaded locally),
and can be told to explicitly download a new index via stack
update
. Because it is highly inefficient to traverse the
tarball each time a lookup needs to occur, Stack will also create a
cache file mapping package name/version/revision to the offset
inside the tarball that it is located.
Stack—like cabal-install
—allows alternative package
indices to be specified. One use case for this is the “corporate
firewall” situation (though it applies to other cases too). Some
companies have restrictive firewalls in place which block outgoing
connections. Or, alternatively, bandwidth may be throttled, and a
local mirror would be preferable. Either way, configuring an
alternative location to download the Hackage package index from is
case 1. To get ahead of myself a bit: there’s no problem with this
use case, and Stack will continue to support configurable mirror
location.
The second case is for providing access to packages which are
not on Hackage. I’ve used this approach in the past myself. It was one of
the original ways you could configure cabal-install
to
use Stackage. With such an alternative index in place,
foo-1.2.3
could mean something different on your
machine than on mine. (Epic foreshadowment right there.)
Let’s start with the easy one: building up the offset indexing
is slow and memory hungry today. I’ve tried optimizing this in the
past, but this is really a pessimal case for Haskell’s memory
management: lots of binary blobs getting inserted into a
HashMap
. Chris Done recently reported to me that this
can take over 1GB of memory, discovered due to a build failure on a
VPS with swap space disabled.
But there’s a more fundamental problem with indices. I raised an issue two weeks back about
a long time concern I’ve had with package indices. Remember that
epic foreshadowment above? Allowing alternative, non-Hackage
package indices means that foo-1.2.3
is now ambiguous.
And worse yet, because package index configuration can live in a
user-wide configuration file, looking at your project’s
stack.yaml
may not reveal this at all.
This kind of trade-off made sense in the past. However, we’ve got two things in Stack pushing against such behavior:
Since we’ll allow overriding the package index location for
mirroring, there’s obviously no way to stop a user from
providing a location that doesn’t mirror Hackage itself. However,
we can discourage this by allowing just one package index location
instead of the current cascading fallback. We can also drop support
for legacy pre-hackage-security 00-index.tar
indices,
which do not provide security guarantees or access to revision
information.
The second change we can make is to be much more thorough about referencing packages via cryptographic hashes instead of by name/version information. This is already necessary for proper reproducibility in a world of Hackage revisions. Part of the ongoing Pantry work will be to automate the process of rewriting configuration files to use cryptographic hashes, which currently is a pain.
Alright, so that’s change one: you only get one package index in Stack, and it should be a Hackage mirror.
The overarching Pantry plans involve referencing many different kinds of files via their cryptographic hashes. We’ll be able to query them over the network securely, and cache them locally. For that local cache, we’re going to use SQLite, which is a great choice for lots of small files.
The pantry
branch of Stack no longer creates that
cache with tarball offsets. Instead, when it downloads a new
01-index.tar
file from Hackage, it populates an SQLite
database with the raw file contents, as well as a table which is
essentially Map (PackageName, Version, RevisionNumber)
HashOfCabalFile
.
As I was bragging about a bit on Twitter, this completely solves the high memory usage for cache creation I mentioned above. Now, updating all ~111,000 cabal files from Hackage takes less than 4mb of resident memory.
At first, it seemed like due to inability to detect Hackage
rebases (where the 01-index.tar
gets updated), we’d
need to totally recalcuate the cache each time stack
update
runs. This is the slow behavior we already have
today. Fortunately, thanks to some insight from Oleg Grenrus, this
turns out to not be necessary, and we can
instead track hashes of the tarball. See Hackage issue #779 for the full
discussion, as well as potentially alternative implementations like
parsing the x-revision
info.
There is a downside to this approach, namely we will end up
storing all of the cabal files twice. Fortunately, the SQLite
storage format with proper table normalization turns out to be
pretty good, resulting in about 0.5GB of storage (around the same
as the 01-index.tar
file itself). However, when we get
to Pantry’s network layer in later posts, we’ll see that in many
common cases, we won’t need to download the full index at all,
saving both bandwidth and disk space. For now, we’re treating disk
space as a cheap commodity, which is basically in line with how all
of Haskell tooling behaves.
Besides the advantages above, some other nice outcomes of this are:
Now that we’re caching the contents of the cabal files themselves in the SQLite database, the next thing will be caching the contents of the tarballs as well. This raises some interesting design questions regarding whether we cache the full original tarballs as they are, or normalize to a more compact format to allow for more data sharing. After weighing the options, we’re going to go with the latter. I’ve already implemented a proof-of-concept for this which works quite well. Now I need to integrate that with the Stack code base.
If you’re interested in the work going on here and would like to discuss, come hit me up on Stack’s Gitter channel.
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.