Working with data in Haskell

In data mining or general exploration, it's common to need to easily access data efficiently and without ceremony. Typically, a programming language will be designed for this case specifically, like R, or a library will be written for it, like Python with the pandas library.

Implementing this in Haskell, we improve upon this area with all the benefits that come with using Haskell over Python or R, such as:

Let's look at an example of doing this in Haskell, and compare with how this is done in Python's pandas. The steps are:

  1. Download a zip file containing a CSV file.
  2. Unzip the file.
  3. Read through the CSV file.
  4. Do some manipulation of the data from the file.

In Haskell we have all the libraries needed (streaming HTTP, CSV parsing, etc.) to achieve this goal, so specifically for this post I've made a wrapper package that brings them together like pandas does. We have some goals:

Python example

This example code was taken from Modern Pandas. In Python we request the web URL in chunks, which we then write to a file. Next, we unzip the file, and then the data is available as df, with column names downcased.

import zipfile
import requests
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

r = requests.get('https://chrisdone.com/ontime.csv.zip', stream=True)
with open("flights.csv", 'wb') as f:
    for chunk in r.iter_content(chunk_size=1024):
        if chunk:
            f.write(chunk)

zf = zipfile.ZipFile("flights.csv.zip")
filename = zf.filelist[0].filename
fp = zf.extract(filename)
df = pd.read_csv(fp, parse_dates="FL_DATE").rename(columns=str.lower)

Finally, we can look at the 5 rows starting at row 10, for the columns fl_date and tail_num, like this:

df.ix[10:14, ['fl_date', 'tail_num']]

=>

    fl_date     tail_num
10  2014-01-01  N002AA
11  2014-01-01  N3FXAA
12  2014-01-01  N906EV
13  2014-01-01  N903EV
14  2014-01-01  N903EV

Python: good and bad

Good parts of the Python code:

Bad parts of the Python code:

Let's compare with the solution I prepared in Haskell. While reading, you can also clone the repository that I put together:

$ git clone [email protected]:chrisdone/labels.git --recursive

The wrapper library created for this post is under labels-explore, and all the code samples are under labels-explore/app/Main.hs.

Haskell example

I prepared the module Labels.Explore which provides us with some data manipulation functionality: web requests, unzipping, CSV parsing, etc.

{-# LANGUAGE TypeApplications, OverloadedStrings, OverloadedLabels,
    TypeOperators, DataKinds, FlexibleContexts #-}

import Labels.Explore

main =
  runResourceT $
  httpSource "https://chrisdone.com/ontime.csv.zip" responseBody .|
  zipEntryConduit "ontime.csv" .|
  fromCsvConduit
    @("fl_date" := Day, "tail_num" := String)
    (set #downcase True csv) .|
  dropConduit 10 .|
  takeConduit 5 .>
  tableSink

Output:

fl_date     tail_num
2014-01-01  N002AA
2014-01-01  N3FXAA
2014-01-01  N906EV
2014-01-01  N903EV
2014-01-01  N903EV

Breaking this down, the src .| c .| c .> sink can be read like a UNIX pipe src | c | c > sink.

The steps are:

In this library the naming convention for parts of the pipline is:

Haskell: good parts

What's good about the Haskell version:

How is it statically typed? Here:

fromCsvConduit
    @("fl_date" := Day, "tail_num" := String)
    csv

We've statically told fromCsvConduit the exact type of record to construct: a record of two fields fl_date and tail_num with types Day and String. Below, we'll look at accessing those fields in an algorithm and demonstrate the safety aspect of this.

Swapping out pipeline parts

We can also easily switch to reading from file. Let's write that URL to disk, uncompressed:

main =
  runResourceT
    (httpSource "https://chrisdone.com/ontime.csv.zip" responseBody .|
     zipEntryConduit "ontime.csv" .>
     fileSink "ontime.csv")

Now our reading becomes:

main =
  runResourceT $
  fileSource "ontime.csv" .|
  fromCsvConduit
    @("fl_date" := Day, "tail_num" := String)
    (set #downcase True csv) .|
  dropConduit 10 .|
  takeConduit 5 .>
  tableSink

Data crunching

It's easy to perform more detailed calculations. For example, to display the number of total flights, and the total distance that would be travelled, we can write:

main =
  runResourceT $
  fileSource "ontime.csv" .|
  fromCsvConduit @("distance" := Double) (set #downcase True csv) .|
  sinkConduit
    (foldSink
       (\table row ->
          modify #flights (+ 1) (modify #distance (+ get #distance row) table))
       (#flights := (0 :: Int), #distance := 0)) .>
  tableSink

The output is:

flights  distance
471949   372072490.0

Above we made our own sink which consumes all the rows, and then yielded the result of that downstream to the table sink, so that we get the nice table display at the end.

Type correctness

Returning to our safety point, imagine above we made some mistakes.

First mistake, I wrote modify #flights twice by accident:

-          modify #flights (+ 1) (modify #distance (+ get #distance row) table))
+          modify #flights (+ 1) (modify #flights (+ get #distance row) table))

Before running the program, the following message would be raised by the Haskell type checker:

• Couldn't match type ‘Int’ with ‘Double’
  arising from a functional dependency between:
  constraint ‘Has "flights" Double ("flights" := Int, "distance" := value0)’
  arising from a use of ‘modify’

See below for where this information comes from in the code:

main =
  runResourceT $
  fileSource "ontime.csv" .|
  --
  --              The distance field is actually a double
  --                             ↓
  --
  fromCsvConduit @("distance" := Double) (set #downcase True csv) .|
  sinkConduit
    (foldSink
       (\table row ->
          modify #flights (+ 1) (modify #flights (+ get #distance row) table))
  --
  -- But we're trying to modify `#flights`, which is an `Int`.
  --                      ↓
  --
       (#flights := (0 :: Int), #distance := 0)) .>
  tableSink

Likewise, if we misspelled #distance as #distant, in our algorithm:

-          modify #flights (+ 1) (modify #distance (+ get #distance row) table))
+          modify #flights (+ 1) (modify #distance (+ get #distant row) table))

We would get this error message:

No instance for (Has "distant" value0 ("distance" := Double))
arising from a use of ‘get’

Summarizing:

All this adds up to more maintainable software, and yet we didn't have to state any more than necessary!

Grouping

If instead we'd like to group by a field, in pandas it's like this:

first = df.groupby('airline_id')[['fl_date', 'unique_carrier']].first()
first.head()

We simply update the code with the type, putting the additional fields we want to parse:

csv :: Csv ("fl_date" := Day, "tail_num" := String
           ,"airline_id" := Int, "unique_carrier" := String)

And then our pipeline instead becomes:

fromCsvConduit
  @("fl_date" := Day, "tail_num" := String,
    "airline_id" := Int, "unique_carrier" := String)
  (set #downcase True csv) .|
groupConduit #airline_id .|
explodeConduit .|
projectConduit @("fl_date" := _, "unique_carrier" := _) .|
takeConduit 5 .>
tableSink

Output:

unique_carrier  fl_date
AA              2014-01-01
AA              2014-01-01
EV              2014-01-01
EV              2014-01-01
EV              2014-01-01

The Python blog post states that a further query upon that result,

first.ix[10:15, ['fl_date', 'tail_num']]

yields an unexpected empty data frame, due to strange indexing behaviour of pandas. But ours works out fine, we just drop 10 elements from the input stream and project tail_num instead:

dropConduit 10 .|
projectConduit @("fl_date" := _, "tail_num" := _) .|
takeConduit 5 .>
tableSink

And we get

fl_date     tail_num
2014-01-01  N002AA
2014-01-01  N3FXAA
2014-01-01  N906EV
2014-01-01  N903EV
2014-01-01  N903EV

Conclusion

In this post we've demonstrated:

  1. Concisely handling a chain of problems smoothly like a bash script.
  2. Done all the above in constant memory usage.
  3. Done so with a type-safe parser, specifying our types statically, but without having to declare or name any record type ahead of time.

This has been a demonstration, and not a finished product. Haskell needs work in this area, and the examples in this post are not performant (but could be), but such work would be very fruitful.

Are the advantages of using Haskell something you're interested in? If so, contact us at FP Complete.

Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.

Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.