A common analytic task is the monte carlo simulation. This
involves sampling large numbers of random values in order to come
to some kind of conclusion.
Random numbers are often thought of as a weak point in purely
functional languages. In a purely functional language like Haskell,
we keep a strict separation between pure and impure functions.
Since by its very nature, a function generating a random value
generates different values each time it's called, we're forced to
move it into the impure world. At that point, it may seem like
we've lost a lot of our claimed benefits from purity.
This blog post is intended to demonstrate that, with the right
choice of libraries, we're able to have our cake and eat it too:
write pure code for the pure parts of our program and keep
side-effects contained. We'll also explore how Haskell's type
system allows us to write even more concise code than dynamically
typed languages, not to mention most statically typed languages,
and how we can move from using gigabytes of memory to just a few
tens of kilobytes.
Calculating pi
For demonstration purposes, we want to start with a simple monte
carlo simulation: estimating the value of pi. To do this, we'll
focus on a 1-by-1 square, and imagine that the bottom-left corner
(i.e., the origin) is the center of a unit circle (i.e., radius of
1). For any point in that square, we can determine if it falls
inside the unit circle by calculating its distance from that
bottom-left corner. In other words, for a point (x, y), if x*x +
y*y < 1, we're inside the circle. Otherwise we're not.
Now comes the monte carlo simulation. We need to generate random
points in that square, and check if they're in the circle. We
determine the probability that one of these random points is in the
circle, and use this to estimate the area of the unit circle which
intersects with our square. Since this is one quarter of the entire
circle, our probability will actually estimate pi divided by 4,
since the area of the unit circle is pi.
Let's get imperative
We want to start off with a baseline approach to the problem. A
common tool in quant research is Python, together with the Numpy
and Scipy libraries. Let's see how they might be used to come up
with a solution:
from numpy import average
from scipy import rand
print average([x[0]*x[0]+x[1]*x[1]<1 for x in rand(10000000,2)])*4
Note: While I'm not an expert in Numpy and Scipy, this example
was provided to me by someone who uses those tools regularly. If
there's a problem in the code or my explanation of it, please let
me know.
The real magic here is the rand
function. It will
generate a matrix with the given dimensions, filled with random
numbers. We pass in two parameters: 10,000,000 to be the number of
experiments to perform, and 2, since each experiment needs two
random values (an x and y coordinate). Our program then uses a list
comprehension to iterate over the 10 million two-length vectors,
and for each vector, determines if the distance from the origin is
less than one. Our boolean result is implicitly converted to a 0 or
1, we get the average, and multiply by 4. When run on my system, I
get the result: 3.1413928.
The problem
The problem with this approach is memory usage: the entire 20
million member matrix is kept in memory at the same time! My system
ended up using 250MB of memory to perform this calculation. But the
problem is worse than that, since the memory use will grow linearly
as we increase our sample size!
We can fairly easily solve our memory problem, by going to an
explicit for loop:
import random
count = 100000000
successes = 0
for i in xrange(count):
x = random.random()
y = random.random()
if x*x + y*y < 1:
successes += 1
print 4.*successes/count
In comparison to the previous code, this program runs in about
3MB of memory, but has twice the run time. (Recommendations for
improvement are welcome.)
While the code isn't complicated, it is pretty tedious.
And maybe this is just the Haskeller in me, but I got seriously
tripped up by the need to use 1.0
instead of just
1
when summing up successes
. Initially, I
just used 1
, which resulted in my answer always being
0, since Python performed integer division instead of floating
point division. It's certainly a strange feeling to get a type
error at runtime!
So we now have two Pythonic approaches to the problem: a
concise, fast, memory-inefficient version, and a longer, slower,
memory-efficient version.
UPDATE There are some great comments on the Python code
below. I'm going to leave the original content of this post
unmodified, but recommend readers look at the comments for
clarifications.
Enter Haskell, and IAP
Let's see how Haskell, and in particular FP Complete's
Integrated Analysis Platform (IAP), would address this problem. One
of the most common complaints we hear from people about their
current analysis tools is their inability to deal well with large
data sets. As such, we've designed our tool with constant memory
analyses as a primary goal. In order to do that, we use the
conduit
library, which is one of Haskell's streaming
data libraries.
The nice thing about a streaming data library like conduit is
that, not only does is make it easy to keep only the bare minimum
data in memory at a time, but it's also designed around separating
pure and impure code. This was one of the areas of concern we
raised at the beginning of our blog post. So let's see how IAP and
conduit allow us to solve the problem:
(Note: If you're viewing this on fpcomplete.com, you can run the
code in the page. Since the School of Haskell servers do not
perform optimized builds, the run time is longer than what you can
expect from a proper compiled build.)
{-# LANGUAGE ExtendedDefaultRules, NoMonomorphismRestriction #-}
import Conduit
main = do
let cnt = 10000000
successes <- sourceRandomN cnt $$ lengthIfC (\(x, y) -> x*x + y*y < 1)
print $ successes / cnt * 4
NOTE If you want to play with this code in the IDE,
please click on the "Open in IDE" button above, go to the settings
page, and change the environment from Stable to Unstable. We're
using some bleeding-edge packages in this example.
We start off with some language extensions. These give us the
same environment as the IAP itself works in, essentially turning
off some stricter type rules to allow for more experimentation. We
then import the Conduit
library, and start off our
main
function.
The first and third lines in our main
function
should be pretty straight-forward: we define how many simulations
we want to perform on line one, and print out our estimate of pi on
line three. All of the work happens on line two. Since this is the
first time we're looking at conduit in this blog, we'll take it
slowly.
When we use conduit, we create a data pipeline. In a
pipeline, we need some data source (also known as a
producer), a data sink (also known as a consumer), and some
optional transformers (not used in this example). Our source is the
function:
sourceRandomN cnt
This function allocates a new random number generator, and then
generates a stream of cnt
random values. Unlike
Scipy's rand
function, it does not create an
in-memory vector. Instead, each new random value is produced on
demand.
Our sink is a little bit more complicated:
lengthIfC (\(x, y) -> x*x + y*y < 1)
lengthIfC
is a function which will determine the
length of its input stream which satisfies some test. (If you're
familiar with Excel, think of countif
.) Its parameter
is a function which returns a Bool
. And thanks to
Haskell's light-weight lambda syntax, we can easily define our
function inline.
Once we have our source and sink, we use conduit's connect
operator ($$
) to plug them together and produce a
final result- in this case, the number of random points that fall
inside our unit circle.
When I run this program on my system, it takes 44KB of memory.
If I increase the sample size to 100 million or higher, it
still takes just 44KB of memory, living up to the promise of
constant memory usage. And as an added bonus, our program also runs
significantly faster: in my test, the Scipy program took 19.620s,
and the Haskell program took 0.812s.
The benefits of a type
system
In our Scipy version of the program, we had to explicitly state
that we wanted to have two values generated for each simulation via
the second parameter to the rand
function. In the
Python explicit iteration version, we called random
twice in each iteration of the loop. How does the Haskell program
know to generate two values, and not just one? The answer is
embedded in our lambda function:
(\(x, y) -> x*x + y*y < 1)
We declared that our lambda function takes a pair of values.
That's all Haskell needs to know: the type inferencer automatically
determines from here that the input to the data sink must be a pair
of values, and if the input must be a pair of values, the output of
the data source must also be a pair of values. And here's
the really cool part: sourceRandomN
is a polymorphic
function, and can generate many different types of random values.
So if we modified our program to instead take three random, it
would still work.
If you're wondering about how this happens, it's all powered by
the Variate
typeclass, which has an instance for each
different type of random value that can be generated. There are
instance for Double
, Int
, and a number of
other numeric types, plus instances for pairs, 3-tuples, and
4-tuples. So if you wanted to write a program that required a
random Int
, Double
, and
Bool
, it will work. (And if you need more than 4
values, you can use nested tuples.)
This begs another question: how does Haskell know to use
Double
s in our analysis? This is once again type
inferencing helping us out, and determining that we needed a
floating point type. The default floating point type is
Double
, so that's what Haskell uses.
Compare all that to Python, where accidentally replacing
1.0
with 1
results in a garbage result
from the program.
Rewrite rules
One other powerful feature that plays into our performance here
is rewrite rules. These are instructions we can give to the
Haskell compiler to say: "you're about to do X, but it would be
faster to do it a different way." A simple example of a rewrite
rule would be to replace something like:
sum [x..y]
with the mathematically
identical:
(x+y)*(y-x+1)/2
Many Haskell libraries- conduit included- use rewrite rules to
get rid of intermediate data structures, and thereby let you as a
user write high-level code while generating more efficient machine
code.
Part of our upcoming IAP work will be researching even more ways
to fuse away data structures in conduit. The great advantage of
that is that you'll be able to start writing code today, and as the
rewrite rules improve, your code will speed up automatically.
A solution that scales
For a problem as simple as the one we've described, it's not
really that difficult to write a solution with most any language.
Even a straight C solution can easily be done in under 15 lines of
code. The question is: how does it handle the large problems?
We're designing IAP to make it easy and natural to compose
smaller programs into larger solutions. By basing everything on
constant-memory operations, we're hoping to help our users avoid
running up against a brick wall when their data sets suddenly jump
in size by three orders of magnitude. In other words, we believe
the first solution you write will be able to be extended to more
complicated problems, and much larger data sets.
Stay tuned to this blog for more demonstrations of our upcoming
IAP product. And if you have questions about how IAP could fit into
your research and analysis platforms, please feel free to contact out support team.
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.
Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.