In data mining or general exploration, it’s common to need to easily access data efficiently and without ceremony. Typically, a programming language will be designed for this case specifically, like R, or a library will be written for it, like Python with the pandas library.
Implementing this in Haskell, we improve upon this area with all the benefits that come with using Haskell over Python or R, such as:
Let’s look at an example of doing this in Haskell, and compare with how this is done in Python’s pandas. The steps are:
In Haskell we have all the libraries needed (streaming HTTP, CSV parsing, etc.) to achieve this goal, so specifically for this post I’ve made a wrapper package that brings them together like pandas does. We have some goals:
This example code was taken from Modern Pandas.
In Python we request the web URL in chunks, which we then write to
a file. Next, we unzip the file, and then the data is available as
df
, with column names downcased.
import zipfile
import requests
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
r = requests.get('https://chrisdone.com/ontime.csv.zip', stream=True)
with open("flights.csv", 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
zf = zipfile.ZipFile("flights.csv.zip")
filename = zf.filelist[0].filename
fp = zf.extract(filename)
df = pd.read_csv(fp, parse_dates="FL_DATE").rename(columns=str.lower)
Finally, we can look at the 5 rows starting at row 10, for the
columns fl_date
and tail_num
, like
this:
df.ix[10:14, ['fl_date', 'tail_num']]
=>
fl_date tail_num
10 2014-01-01 N002AA
11 2014-01-01 N3FXAA
12 2014-01-01 N906EV
13 2014-01-01 N903EV
14 2014-01-01 N903EV
Good parts of the Python code:
parse_dates
).Bad parts of the Python code:
fl_date
and tail_num
, we can’t be certain
down the line if they still exist, or are of the right type.Let’s compare with the solution I prepared in Haskell. While reading, you can also clone the repository that I put together:
$ git clone [email protected]:chrisdone/labels.git --recursive
The wrapper library created for this post is under labels-explore, and all the code samples are under labels-explore/app/Main.hs.
I prepared the module Labels.Explore
which provides
us with some data manipulation functionality: web requests,
unzipping, CSV parsing, etc.
{-# LANGUAGE TypeApplications, OverloadedStrings, OverloadedLabels, TypeOperators, DataKinds, FlexibleContexts #-} import Labels.Explore main = runResourceT $ httpSource "https://chrisdone.com/ontime.csv.zip" responseBody .| zipEntryConduit "ontime.csv" .| fromCsvConduit @("fl_date" := Day, "tail_num" := String) (set #downcase True csv) .| dropConduit 10 .| takeConduit 5 .> tableSink
Output:
fl_date tail_num 2014-01-01 N002AA 2014-01-01 N3FXAA 2014-01-01 N906EV 2014-01-01 N903EV 2014-01-01 N903EV
Breaking this down, the src .| c .| c .> sink
can be read like a UNIX pipe src | c | c >
sink
.
The steps are:
("fl_date" := Day,
"tail_num" := String)
.downcase
option so we can deal with
lower-case names.In this library the naming convention for parts of the pipline is:
What’s good about the Haskell version:
fl_date
as a number, for example, or mistakenly write
fl_daet
, I’ll get a compile error before ever running
the program.How is it statically typed? Here:
fromCsvConduit @("fl_date" := Day, "tail_num" := String) csv
We’ve statically told fromCsvConduit
the exact type
of record to construct: a record of two fields fl_date
and tail_num
with types Day
and
String
. Below, we’ll look at accessing those fields in
an algorithm and demonstrate the safety aspect of this.
We can also easily switch to reading from file. Let’s write that URL to disk, uncompressed:
main = runResourceT (httpSource "https://chrisdone.com/ontime.csv.zip" responseBody .| zipEntryConduit "ontime.csv" .> fileSink "ontime.csv")
Now our reading becomes:
main = runResourceT $ fileSource "ontime.csv" .| fromCsvConduit @("fl_date" := Day, "tail_num" := String) (set #downcase True csv) .| dropConduit 10 .| takeConduit 5 .> tableSink
It’s easy to perform more detailed calculations. For example, to display the number of total flights, and the total distance that would be travelled, we can write:
main = runResourceT $ fileSource "ontime.csv" .| fromCsvConduit @("distance" := Double) (set #downcase True csv) .| sinkConduit (foldSink (table row -> modify #flights (+ 1) (modify #distance (+ get #distance row) table)) (#flights := (0 :: Int), #distance := 0)) .> tableSink
The output is:
flights distance 471949 372072490.0
Above we made our own sink which consumes all the rows, and then yielded the result of that downstream to the table sink, so that we get the nice table display at the end.
Returning to our safety point, imagine above we made some mistakes.
First mistake, I wrote modify #flights
twice by
accident:
- modify #flights (+ 1) (modify #distance (+ get #distance row) table))
+ modify #flights (+ 1) (modify #flights (+ get #distance row) table))
Before running the program, the following message would be raised by the Haskell type checker:
• Couldn't match type ‘Int’ with ‘Double’
arising from a functional dependency between:
constraint ‘Has "flights" Double ("flights" := Int, "distance" := value0)’
arising from a use of ‘modify’
See below for where this information comes from in the code:
main = runResourceT $ fileSource "ontime.csv" .| -- -- The distance field is actually a double -- ↓ -- fromCsvConduit @("distance" := Double) (set #downcase True csv) .| sinkConduit (foldSink (table row -> modify #flights (+ 1) (modify #flights (+ get #distance row) table)) -- -- But we're trying to modify `#flights`, which is an `Int`. -- ↓ -- (#flights := (0 :: Int), #distance := 0)) .> tableSink
Likewise, if we misspelled #distance
as
#distant
, in our algorithm:
- modify #flights (+ 1) (modify #distance (+ get #distance row) table))
+ modify #flights (+ 1) (modify #distance (+ get #distant row) table))
We would get this error message:
No instance for (Has "distant" value0 ("distance" := Double))
arising from a use of ‘get’
Summarizing:
All this adds up to more maintainable software, and yet we didn’t have to state any more than necessary!
If instead we’d like to group by a field, in pandas it’s like this:
first = df.groupby('airline_id')[['fl_date', 'unique_carrier']].first()
first.head()
We simply update the code with the type, putting the additional fields we want to parse:
csv :: Csv ("fl_date" := Day, "tail_num" := String ,"airline_id" := Int, "unique_carrier" := String)
And then our pipeline instead becomes:
fromCsvConduit @("fl_date" := Day, "tail_num" := String, "airline_id" := Int, "unique_carrier" := String) (set #downcase True csv) .| groupConduit #airline_id .| explodeConduit .| projectConduit @("fl_date" := _, "unique_carrier" := _) .| takeConduit 5 .> tableSink
#airline_id
field into a stream
of lists of rows. That groups the stream [x,y,z,a,b,c]
into e.g. [[x,y],[z,a],[b,c]]
.[[x,y],[z,a],[b,c],...]
into a stream of each group’s parts:
[x,y,z,a,b,c,...]
.fl_date
and
unique_carrier
. The types are to be left as-is, so we
use _
to mean “you know what I mean”. This is like
SELECT fl_date, unique_carrier
in SQL.Output:
unique_carrier fl_date AA 2014-01-01 AA 2014-01-01 EV 2014-01-01 EV 2014-01-01 EV 2014-01-01
The Python blog post states that a further query upon that result,
first.ix[10:15, ['fl_date', 'tail_num']]
yields an unexpected empty data frame, due to strange indexing
behaviour of pandas. But ours works out fine, we just drop 10
elements from the input stream and project tail_num
instead:
dropConduit 10 .| projectConduit @("fl_date" := _, "tail_num" := _) .| takeConduit 5 .> tableSink
And we get
fl_date tail_num 2014-01-01 N002AA 2014-01-01 N3FXAA 2014-01-01 N906EV 2014-01-01 N903EV 2014-01-01 N903EV
In this post we’ve demonstrated:
This has been a demonstration, and not a finished product. Haskell needs work in this area, and the examples in this post are not performant (but could be), but such work would be very fruitful.
Are the advantages of using Haskell something you’re interested in? If so, contact us at FP Complete.
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.