A common analytic task is the monte carlo simulation. This involves sampling large numbers of random values in order to come to some kind of conclusion.
Random numbers are often thought of as a weak point in purely functional languages. In a purely functional language like Haskell, we keep a strict separation between pure and impure functions. Since by its very nature, a function generating a random value generates different values each time it’s called, we’re forced to move it into the impure world. At that point, it may seem like we’ve lost a lot of our claimed benefits from purity.
This blog post is intended to demonstrate that, with the right choice of libraries, we’re able to have our cake and eat it too: write pure code for the pure parts of our program and keep side-effects contained. We’ll also explore how Haskell’s type system allows us to write even more concise code than dynamically typed languages, not to mention most statically typed languages, and how we can move from using gigabytes of memory to just a few tens of kilobytes.
For demonstration purposes, we want to start with a simple monte carlo simulation: estimating the value of pi. To do this, we’ll focus on a 1-by-1 square, and imagine that the bottom-left corner (i.e., the origin) is the center of a unit circle (i.e., radius of 1). For any point in that square, we can determine if it falls inside the unit circle by calculating its distance from that bottom-left corner. In other words, for a point (x, y), if x*x + y*y < 1, we’re inside the circle. Otherwise we’re not.
Now comes the monte carlo simulation. We need to generate random points in that square, and check if they’re in the circle. We determine the probability that one of these random points is in the circle, and use this to estimate the area of the unit circle which intersects with our square. Since this is one quarter of the entire circle, our probability will actually estimate pi divided by 4, since the area of the unit circle is pi.
We want to start off with a baseline approach to the problem. A common tool in quant research is Python, together with the Numpy and Scipy libraries. Let’s see how they might be used to come up with a solution:
from numpy import average
from scipy import rand
print average([x[0]*x[0]+x[1]*x[1]<1 for x in rand(10000000,2)])*4
Note: While I’m not an expert in Numpy and Scipy, this example was provided to me by someone who uses those tools regularly. If there’s a problem in the code or my explanation of it, please let me know.
The real magic here is the rand
function. It will
generate a matrix with the given dimensions, filled with random
numbers. We pass in two parameters: 10,000,000 to be the number of
experiments to perform, and 2, since each experiment needs two
random values (an x and y coordinate). Our program then uses a list
comprehension to iterate over the 10 million two-length vectors,
and for each vector, determines if the distance from the origin is
less than one. Our boolean result is implicitly converted to a 0 or
1, we get the average, and multiply by 4. When run on my system, I
get the result: 3.1413928.
The problem with this approach is memory usage: the entire 20 million member matrix is kept in memory at the same time! My system ended up using 250MB of memory to perform this calculation. But the problem is worse than that, since the memory use will grow linearly as we increase our sample size!
We can fairly easily solve our memory problem, by going to an explicit for loop:
import random
count = 100000000
successes = 0
for i in xrange(count):
x = random.random()
y = random.random()
if x*x + y*y < 1:
successes += 1
print 4.*successes/count
In comparison to the previous code, this program runs in about 3MB of memory, but has twice the run time. (Recommendations for improvement are welcome.)
While the code isn’t complicated, it is pretty tedious.
And maybe this is just the Haskeller in me, but I got seriously
tripped up by the need to use 1.0
instead of just
1
when summing up successes
. Initially, I
just used 1
, which resulted in my answer always being
0, since Python performed integer division instead of floating
point division. It’s certainly a strange feeling to get a type
error at runtime!
So we now have two Pythonic approaches to the problem: a concise, fast, memory-inefficient version, and a longer, slower, memory-efficient version.
UPDATE There are some great comments on the Python code below. I’m going to leave the original content of this post unmodified, but recommend readers look at the comments for clarifications.
Let’s see how Haskell, and in particular FP Complete’s
Integrated Analysis Platform (IAP), would address this problem. One
of the most common complaints we hear from people about their
current analysis tools is their inability to deal well with large
data sets. As such, we’ve designed our tool with constant memory
analyses as a primary goal. In order to do that, we use the
conduit
library, which is one of Haskell’s streaming
data libraries.
The nice thing about a streaming data library like conduit is that, not only does is make it easy to keep only the bare minimum data in memory at a time, but it’s also designed around separating pure and impure code. This was one of the areas of concern we raised at the beginning of our blog post. So let’s see how IAP and conduit allow us to solve the problem:
(Note: If you’re viewing this on fpcomplete.com, you can run the code in the page. Since the School of Haskell servers do not perform optimized builds, the run time is longer than what you can expect from a proper compiled build.)
{-# LANGUAGE ExtendedDefaultRules, NoMonomorphismRestriction #-}
import Conduit
main = do
let cnt = 10000000
successes <- sourceRandomN cnt $$ lengthIfC ((x, y) -> x*x + y*y < 1)
print $ successes / cnt * 4
NOTE If you want to play with this code in the IDE, please click on the “Open in IDE” button above, go to the settings page, and change the environment from Stable to Unstable. We’re using some bleeding-edge packages in this example.
We start off with some language extensions. These give us the
same environment as the IAP itself works in, essentially turning
off some stricter type rules to allow for more experimentation. We
then import the Conduit
library, and start off our
main
function.
The first and third lines in our main
function
should be pretty straight-forward: we define how many simulations
we want to perform on line one, and print out our estimate of pi on
line three. All of the work happens on line two. Since this is the
first time we’re looking at conduit in this blog, we’ll take it
slowly.
When we use conduit, we create a data pipeline. In a pipeline, we need some data source (also known as a producer), a data sink (also known as a consumer), and some optional transformers (not used in this example). Our source is the function:
sourceRandomN cnt
This function allocates a new random number generator, and then
generates a stream of cnt
random values. Unlike
Scipy’s rand
function, it does not create an
in-memory vector. Instead, each new random value is produced on
demand.
Our sink is a little bit more complicated:
lengthIfC ((x, y) -> x*x + y*y < 1)
lengthIfC
is a function which will determine the
length of its input stream which satisfies some test. (If you’re
familiar with Excel, think of countif
.) Its parameter
is a function which returns a Bool
. And thanks to
Haskell’s light-weight lambda syntax, we can easily define our
function inline.
Once we have our source and sink, we use conduit’s connect
operator ($$
) to plug them together and produce a
final result- in this case, the number of random points that fall
inside our unit circle.
When I run this program on my system, it takes 44KB of memory. If I increase the sample size to 100 million or higher, it still takes just 44KB of memory, living up to the promise of constant memory usage. And as an added bonus, our program also runs significantly faster: in my test, the Scipy program took 19.620s, and the Haskell program took 0.812s.
In our Scipy version of the program, we had to explicitly state
that we wanted to have two values generated for each simulation via
the second parameter to the rand
function. In the
Python explicit iteration version, we called random
twice in each iteration of the loop. How does the Haskell program
know to generate two values, and not just one? The answer is
embedded in our lambda function:
({-hi-}(x, y){-/hi-} -> x*x + y*y < 1)
We declared that our lambda function takes a pair of values.
That’s all Haskell needs to know: the type inferencer automatically
determines from here that the input to the data sink must be a pair
of values, and if the input must be a pair of values, the output of
the data source must also be a pair of values. And here’s
the really cool part: sourceRandomN
is a polymorphic
function, and can generate many different types of random values.
So if we modified our program to instead take three random, it
would still work.
If you’re wondering about how this happens, it’s all powered by
the Variate
typeclass, which has an instance for each
different type of random value that can be generated. There are
instance for Double
, Int
, and a number of
other numeric types, plus instances for pairs, 3-tuples, and
4-tuples. So if you wanted to write a program that required a
random Int
, Double
, and
Bool
, it will work. (And if you need more than 4
values, you can use nested tuples.)
This begs another question: how does Haskell know to use
Double
s in our analysis? This is once again type
inferencing helping us out, and determining that we needed a
floating point type. The default floating point type is
Double
, so that’s what Haskell uses.
Compare all that to Python, where accidentally replacing
1.0
with 1
results in a garbage result
from the program.
One other powerful feature that plays into our performance here is rewrite rules. These are instructions we can give to the Haskell compiler to say: “you’re about to do X, but it would be faster to do it a different way.” A simple example of a rewrite rule would be to replace something like:
sum [x..y]
with the mathematically identical:
(x+y)*(y-x+1)/2
Many Haskell libraries- conduit included- use rewrite rules to get rid of intermediate data structures, and thereby let you as a user write high-level code while generating more efficient machine code.
Part of our upcoming IAP work will be researching even more ways to fuse away data structures in conduit. The great advantage of that is that you’ll be able to start writing code today, and as the rewrite rules improve, your code will speed up automatically.
For a problem as simple as the one we’ve described, it’s not really that difficult to write a solution with most any language. Even a straight C solution can easily be done in under 15 lines of code. The question is: how does it handle the large problems?
We’re designing IAP to make it easy and natural to compose smaller programs into larger solutions. By basing everything on constant-memory operations, we’re hoping to help our users avoid running up against a brick wall when their data sets suddenly jump in size by three orders of magnitude. In other words, we believe the first solution you write will be able to be extended to more complicated problems, and much larger data sets.
Stay tuned to this blog for more demonstrations of our upcoming IAP product. And if you have questions about how IAP could fit into your research and analysis platforms, please feel free to contact out support team.
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.