As most of us know, performance isn’t a one-dimensional spectrum. There are in fact multiple different ways to judge performance of a program. A commonly recognized tradeoff is that between CPU and memory usage. Often times, a program can be sped up by caching more data, for example.
conduit is a streaming data library. In that sense, it has two very specific performance criterion it aims for:
While CPU performance is always a nice goal, it has never been my top priority in the library’s design, especially given that in the main use case for conduit (streaming data in an I/O context), the I/O cost almost always far outweighs any CPU overhead from conduit.
However, for our upcoming Integrated Analysis Platform (IAP) release, this is no longer the case. conduit will be used in tight loops, where we do need to optimize for the lowest CPU overhead possible.
This blog post covers the first set of optimizations I’ve applied to conduit. There is still more work to be done, and throughout this blogpost I’ll be describing some of the upcoming changes I am attempting.
I’ll give a brief summary up front:
Data.Conduit.Internal
module. If you’ve just been
using the public API, your code will be unaffected, besides getting
an automatic speedup.Note that this blog post follows the actual steps I went through (more or less) in identifying the performance issues I wanted to solve. If you want to skip ahead to the solution itself, you may want to skip to the discussion on difference lists, or even straight to continuation passing style, church-encoding, codensity.
By the way, after I originally wrote this blog post, I continued working on the optimizations I describe as possible future enhancements. Those are actually working out far better than I expected, and it looks like conduit 1.2.0 will be able to ship with them. I’ll be writing a separate blog post detailing those changes. A bit of a teaser is: for vector-equivalent code, conduit now generates identical core as vector itself.
Before embarking on any kind of serious optimizations, it’s important to have some benchmarks. I defined three benchmarks for the work I was going to be doing:
A simple sum: adding up the numbers from 1 to 10000. This is to get a baseline of the overhead coming from conduit.
A monte carlo analysis: This was based on a previous IAP blog post. I noticed when working on that benchmark that, while the conduit solution was highly memory efficient, there was still room to speed up the benchmark.
Sliding vectors: Naren Sundar recently sent a sliding windows pull requests, which allow us to get a view of a fixed size of a stream of values. This feature is very useful for a number of financial analyses, especially regarding time series.
Naren’s pull request was based on immutable data structures, and for those cases it is highly efficient. However, it’s possible to be far more memory efficient by writing to a mutable vector instead, and then taking immutable slices of that vector. Mihaly Barasz sent a pull request for this feature, and much to our disappointment, for small window sizes, it performed worse than sliding windows. We want to understand why.
You can see the benchmark code, which stays mostly unchanged for the rest of this blog post (a few new cases are added to demonstrate extra points). The benchmarks always contain a low-level base case representing the optimal performance we can expect from hand-written Haskell (without resorting to any kind of FFI tricks or the like).
You can see the first run results which reflect conduit 1.1.7, plus inlining of a few functions. Some initial analysis:
That hopefully sets the scene enough for us to begin to dive in.
GHC offers a very powerful optimization technique: rewrite
rules. This allows you to tell the compiler that a certain
expression can be rewritten to a more efficient one. A common
example of a rewrite rule would be to state that map f . map
g
is the same as map (f . g)
. This can be
expressed as:
{-# RULES “map f . map g” forall f g. map f . map g = map (f . g) #-}
Note that GHC’s list rewrite rules are actually more complicated than this, and revolve around a concept called build/foldr fusion.
Let’s look at the implementation of the yield
function in conduit (with some newtypes stripped away):
yield :: Monad m => o -> ConduitM i o m () yield o = HaveOutput (Done ()) (return ()) o {-# INLINE [1] yield #-} {-# RULES "yield o >> p" forall o (p :: ConduitM i o m r). yield o >> p = HaveOutput p (return ()) o #-}
The core datatype of conduit is recursive. The
HaveOutput
constructor contains a field for “what to
do next.” In the case of yield
, there isn’t
anything to do next, so we fill that with Done ()
.
However, creating that Done ()
value just to throw it
away after a monadic bind is wasteful. So we have a rewrite rule to
fuse those two steps together.
But no such rewrite rule exists for lift
! My first
step was to
add such a rule, and
check the results. Unfortunately, the rule didn’t have any real
impact, because it wasn’t firing. Let’s put that issue to the side;
we’ll come back to it later.
One of the nice features introduced in (I believe) GHC 7.8 is that the compiler will now warn you when a rewrite rule may not fire. When compiling conduit, I saw messages like:
Data/Conduit/List.hs:274:11: Warning: Rule "source/map fusion $=" may never fire because ‘$=’ might inline first Probable fix: add an INLINE[n] or NOINLINE[n] pragma on ‘$=’ Data/Conduit/List.hs:275:11: Warning: Rule "source/map fusion =$=" may never fire because ‘=$=’ might inline first Probable fix: add an INLINE[n] or NOINLINE[n] pragma on ‘=$=’ Data/Conduit/List.hs:542:11: Warning: Rule "source/filter fusion $=" may never fire because ‘$=’ might inline first Probable fix: add an INLINE[n] or NOINLINE[n] pragma on ‘$=’ Data/Conduit/List.hs:543:11: Warning: Rule "source/filter fusion =$=" may never fire because ‘=$=’ might inline first Probable fix: add an INLINE[n] or NOINLINE[n] pragma on ‘=$=’ Data/Conduit/List.hs:552:11: Warning: Rule "connect to sinkNull" may never fire because ‘$$’ might inline first Probable fix: add an INLINE[n] or NOINLINE[n] pragma on ‘$$’
This demonstrates an important interaction between inlining and rewrite rules. We need to make sure that expressions that need to be rewritten are not inlined first. If they are first inlined, then GHC won’t be able to rewrite them to our more optimized version.
A common approach to this is to delay inlining of functions until a later simplification phase. The GHC simplification process runs in multiple steps, and we can state that rules and inlining should only happen before or after a certain phase. The phases count down from 2 to 0, so we commonly want to delay inlining of functions until phase 0, if they may be subject to rewriting.
Conversely, some functions need to be inlined before a rewrite rule can fire. In stream fusion, for example, the fusion framework depends on the following sequencing to get good performance:
map f . map g -- inline map unstream . mapS f . stream . unstream . mapS g . stream -- rewrite stream . unstream unstream . mapS f . mapS g . stream -- rewrite mapS . mapS unstream . mapS (f . g) . stream
In conduit, we need to make sure that all of this is happening
in the correct order. There was one particular complexity that made
it difficult to ensure this happened. conduit in fact has
two core datatypes: Pipe
and
ConduitM
, with the latter being a more friendly
newtype wrapper around the first. Up until this point, the code for
the two was jumbled into a single internal module, making it
difficult to track which things were being written in which version
of the API.
My next step was to split things into .Pipe and .Conduit internal modules, and then clean up GHC’s warnings to get rules to fire more reliably. This gave a modest performance boost to the sliding vector benchmarks, but not much else. But it does pave the way for future improvements.
The results so far have been uninspiring. We’ve identified a
core problem (too many of those Done
data constructors
being used), and noticed that the rewrite rules that should
fix that don’t seem to be doing their job. Now let’s take our first
stab at really improving performance: with aggressive
rewrite rules.
Our sum
benchmark is really simple: use
enumFromTo
to create a stream of values, and
fold
(or foldM
) to consume that. The
thing that slows us down is that, in between these two simple
functions, we end up allocating a bunch of temporary data
structures. Let’s
get rid of them with rewrite rules!
This
certainly did the trick. The conduit implementation jumped from
185us to just 8.63us. For comparison, the low level approach (or
vector’s stream fusion) clocks in at 5.77us, whereas
foldl'
on a list is 80.6us. This is a huge win!
But it’s also misleading. All we’ve done here is sneakily
rewritten our conduit algorithm into a low-level format. This
solves the specific problem on the table (connecting enumFromTo
with fold), but won’t fully generalize to other cases. A more
representative demonstration of this improvement is the speedup for
foldM
, which went from 1180us to 81us. The reason this
is more realistic is that the rewrite rule is not specialized to
enumFromTo
, but rather works on any
Source
.
I took a big detour at this point, and ended up writing an initial implementation of stream fusion in conduit. Unfortunately, I ran into a dead end on that branch, and had to put that work to the side temporarily. However, the improvements discussed in the rest of this blog post will hopefully reopen the door to stream fusion, which I hope to investigate next.
Now that I’d made the results of the sum benchmark thoroughly useless, I decided to focus on the results of monte carlo, where the low level implementation still won by a considerable margin (3.42ms vs 10.6ms). The question was: why was this happening? To understand, let’s start by looking at the code:
analysis = do successes <- sourceRandomN count $$ CL.fold (t (x, y) -> if (x*x + y*(y :: Double) < 1) then t + 1 else t) (0 :: Int) return $ fromIntegral successes / fromIntegral count * 4 sourceRandomN :: (MWC.Variate a, MonadIO m) => Int -> Source m a sourceRandomN cnt0 = do gen <- liftIO MWC.createSystemRandom let loop 0 = return () loop cnt = do liftIO (MWC.uniform gen) >>= yield >> loop (cnt - 1) loop cnt0
The analysis
function is not very interesting: it
simply connects sourceRandomN
with a
fold
. Given that we now have a well behaved and
consistently-firing rewrite rule for connecting to folds, it’s safe
to say that was not the source of our slowdown. So our slowdown
must be coming from:
liftIO (MWC.uniform gen) >>= yield >> loop (cnt - 1)
This should in theory generate really efficient code.
yield >> loop (cnt - 1)
should be rewritten to
x -> HaveOutput (loop (cnt - 1)) (return ()) x)
,
and then liftIO
should get rewritten to generate:
PipeM $ do x <- MWC.uniform gen return $ HaveOutput (loop $ cnt - 1) (return ()) x
I added another commit to include a few more versions of the monte carlo benchmark (results here). The two most interesting are:
Explicit usage of the Pipe
constructors:
sourceRandomNConstr :: (MWC.Variate a, MonadIO m) => Int -> Source m a sourceRandomNConstr cnt0 = ConduitM $ PipeM $ do gen <- liftIO MWC.createSystemRandom let loop 0 = return $ Done () loop cnt = do x <- liftIO (MWC.uniform gen) return $ HaveOutput (PipeM $ loop (cnt - 1)) (return ()) x loop cnt0
This version ran in 4.84ms, vs the original conduit version which ran in 15.8ms. So this is definitely the problem!
Explicitly force right-associated binding order:
sourceRandomNBind :: (MWC.Variate a, MonadIO m) => Int -> Source m a sourceRandomNBind cnt0 = lift (liftIO MWC.createSystemRandom) >>= gen -> let loop 0 = return () loop cnt = do lift (liftIO $ MWC.uniform gen) >>= (o -> yield o >> loop (cnt - 1)) in loop cnt0
Or to zoom in on the important bit:
lift (liftIO $ MWC.uniform gen) >>= (o -> yield o >> loop (cnt - 1))
By the monad laws, this code is identical to the original. However, instead of standard left-associativity, we have right associativity or monadic bind. This code ran in 5.19ms, an approximate threefold speedup vs the left associative code!
This issue of associativity was something Roman Cheplyaka
told me about back in April, so I wasn’t surprised to see it
here. Back then, I’d looked into using Codensity
together with ConduitM
, but didn’t get immediate
results, and therefore postponed further research until I had more
time.
OK, so why exactly does left-associativity hurt us so much? There are two reasons actually:
lift
and yield
rewrite rules from firing,
which introduced extra, unnecessary monadic bind operations.
Forcing right associativity allows these rules to fire, avoiding a
lot of unnecessary data constructor allocation and analysis.At this point, it became obvious at this point that the main slowdown I was seeing was driven by this problem. The question is: how should we solve it?
To pave the way for the next step, I want to take a quick detour and talk about something simpler: difference lists. Consider the following code:
(((w ++ x) ++ y) ++ z)
Most experienced Haskellers will cringe upon reading that. The
append operation for a list needs to traverse every cons cell in
its left value. When we left-associate append operations like this,
we will need to traverse every cell in w
, then every
cell in w ++ x
, then every cell in w ++ x ++
y
. This is highly inefficient, and would clearly be better
done in a right-associated style (sound familiar?).
But forcing programmers to ensure that their code is always
right-associated isn’t always practical. So instead, we have two
common alternatives. The first is: use a better datastructure. In
particular, Data.Sequence
has far cheaper append
operations than lists.
The other approach is to use difference lists. Difference lists are functions instead of actual list values. They are instructions for adding values to the beginning of the list. In order to append, you use normal function composition. And to convert them to a list, you apply the resulting function to an empty list. As an example:
type DList a = [a] -> [a]
dlist1 :: DList Int
dlist1 rest = 1 : 2 : rest
dlist2 :: DList Int
dlist2 rest = 3 : 4 : rest
final :: [Int]
final = dlist1 . dlist2 $ []
main :: IO ()
main = print final
Both difference lists and sequences have advantages. Probably the simplest summary is:
That second point is important. If you need to regularly analyze your list and then continue to append, the performance of a difference list will be abysmal. You will constantly be swapping representations, and converting from a list to a difference list is an O(n) operation. But if you will simply be constructing a list once without any analysis, odds are difference lists will be faster.
This situation is almost identical to our problems with conduit. Our monadic composition operator- like list’s append operator- needs to traverse the entire left hand side. This connection is more clearly spelled out in Reflection without Remorse by Atze van der Ploeg and Oleg Kiselyov (and for me, care of Roman).
Alright, with that out of the way, let’s finally fix conduit!
There are essentially two things we need to do with conduits:
The latter requires that we be able to case analyze our datatypes, while theoretically the former does not: something like difference lists for simple appending would be ideal. In the past, I’ve tried out a number of different alternative implementations of conduit, none of which worked well enough. The problem I always ran into was that either monadic bind became too expensive, or categorical composition became too expensive.
Roman, Mihaly, Edward and I discussed these issues a bit on Github, and based on Roman’s advice, I went ahead with writing a benchmark of different conduit implementations. I currently have four implementations in this benchmark (and hope to add more):
You can see the benchmark results, which clearly show the codensity version to be the winner. Though it would be interesting, I think I’ll avoid going into depth on the other three implementations for now (this blog post is long enough already).
Implementing Codensity in conduit just means changing the
ConduitM
newtype wrapper to look like this:
newtype ConduitM i o m r = ConduitM { unConduitM :: forall b. (r -> Pipe i i o () m b) -> Pipe i i o () m b }
What this says is “I’m going to provide an r
value.
If you give me a function that needs an r
value, I’ll
give it that r
value and then continue with the
resulting Pipe
.” Notice how similar this looks to the
type signature of monadic bind itself:
(>>=) :: Pipe i i o () m r -> (r -> Pipe i i o () m b) -> Pipe i i o () m b
This isn’t by chance, it’s by construction. More information is available in the Haddocks of kan-extension, or in the above-linked paper and blog posts by Janis and Edward. To see why this change is important, let’s look at the new implementations of some of the core conduit functions and type classes:
yield o = ConduitM $ rest -> HaveOutput (rest ()) (return ()) o await = ConduitM $ f -> NeedInput (f . Just) (const $ f Nothing) instance Monad (ConduitM i o m) where return x = ConduitM ($ x) ConduitM f >>= g = ConduitM $ h -> f $ a -> unConduitM (g a) h instance MonadTrans (ConduitM i o) where lift mr = ConduitM $ rest -> PipeM (liftM rest mr)
Instead of having explicit Done
constructors in
yield
, await
, and lift
, we
use the continuation rest
. This is the exact same
transformation we were previously relying on rewrite rules to
provide. However, our rewrite rules couldn’t fire properly in a
left-associated monadic binding. Now we’ve avoided the whole
problem!
Our Monad
instance also became much smaller. Notice
that in order to monadically compose, there is no longer any need
to case-analyze the left hand side, which avoids the high penalty
of left association.
Another interesting quirk is that our Monad
instance on ConduitM
no longer requires that the base
m
type constructor itself be a Monad
.
This is nice feature of Codensity
.
So that’s half the story. What about categorical composition?
That certainly does require analyzing both the left and
right hand structures. So don’t we lose all of our speed gains of
Codensity
with this? Actually, I think not. Let’s look
at the code for categorical composition:
ConduitM left0 =$= ConduitM right0 = ConduitM $ rest -> let goRight final left right = case right of HaveOutput p c o -> HaveOutput (recurse p) (c >> final) o NeedInput rp rc -> goLeft rp rc final left Done r2 -> PipeM (final >> return (rest r2)) PipeM mp -> PipeM (liftM recurse mp) Leftover right' i -> goRight final (HaveOutput left final i) right' where recurse = goRight final left goLeft rp rc final left = case left of HaveOutput left' final' o -> goRight final' left' (rp o) NeedInput left' lc -> NeedInput (recurse . left') (recurse . lc) Done r1 -> goRight (return ()) (Done r1) (rc r1) PipeM mp -> PipeM (liftM recurse mp) Leftover left' i -> Leftover (recurse left') i where recurse = goLeft rp rc final in goRight (return ()) (left0 Done) (right0 Done)
In the last line, we apply left0
and
right0
to Done
, which is how we convert
our Codensity version into something we can actually analyze. (This
is equivalent to applying a difference list to an empty list.) We
then traverse these values in the same way that we did in conduit
1.1 and earlier.
The important difference is how we ultimately finish. The code
in question is the Done
clause of the
goRight
‘s case analysis, namely:
Done r2 -> PipeM (final >> return (rest r2))
Notice the usage of rest
, instead of what we would
have previously done: used the Done
constructor. By
doing this, we’re immediately recreating a Codensity version of our
resulting Pipe
, which allows us to only traverse our
incoming Pipe
values once each, and not need to
retraverse the outgoing Pipe
for future monadic
binding.
This trick doesn’t just work for composition. There are a large
number of functions in conduit
that need to analyze a
Pipe
, such as addCleanup
and
catchC
. All of them are now implemented in this same
style.
After implementing this change, the resulting benchmarks look much better. The naive implementation of monte carlo is now quite close to the low-level version (5.28ms vs 3.44ms, as opposed to the original 15ms). Sliding vector is also much better: the unboxed, 1000-size window benchmark went from 7.96ms to 4.05ms, vs a low-level implementation at 1.87ms.
One approach that I haven’t tried yet is the type-indexed sequence approach from Reflection without Remorse. I still intend to add it to my conduit benchmark, but I’m not optimistic about it beating out Codensity. My guess is that a sequence data type will have a higher constant factor overhead, and based on the way composition is implemented in conduit, we won’t get any benefit from avoiding the need to transition between two representations.
Edward said he’s hoping to get an implementation of such a data structure into the free package, at which point I’ll update my benchmark to see how it performs.
While this round of benchmarking produced some very nice results, we’re clearly not yet at the same level as low-level code. My goal is to focus on that next. I have some experiments going already relating to getting conduit to expose stream fusion rules. In simple cases, I’ve generated a conduit-compatible API with the same performance as vector.
The sticking point is getting something which is efficient not
just for functions explicitly written in stream style, but also
provides decent performance when composed with the
await
/yield
approach. While the latter
approach will almost certainly be slower than stream fusion, I’m
hoping we can get it to degrade to current-conduit performance
levels, and allow stream fusion to provide a significant speedup
when categorically composing two Conduit
s written in
that style.
The code discussed in this post is now available on the
next-cps
branch of conduit. conduit-extra,
conduit-combinators, and a number of other packages either compile
out-of-the-box with these changes, or require minor tweaks (already
implemented), so I’m hoping that this API change does not affect
too many people.
As I mentioned initially, I’d like to have some time for community discussion on this before I make this next release.
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.