The cryptonite library is the de facto standard in Haskell today for cryptography. It provides support for secure random number generic, shared key and public key encryption, message authentification codes (MACs), and—for our purposes today—cryptographic hashes.
For those unfamiliar: a hash function is a function that maps data from arbitrary size to a fixed size. A cryptographic hash function is a hash function with properties suitable for cryptography (see Wikipedia article for more details). A common example of cryptographic hash usage is providing a checksum on a file download to ensure it has not be tampered with. Common algorithms in use today include SHA256, Skein512, and (the slightly outdated) MD5.
The cryptonite library is built on top of the memory library, which
provides type classes and convenience functions for dealing with
reading and creating byte arrays. You may initially think
“shouldn’t it all be ByteString
s.” We’ll get to why
the type classes are so helpful later.
Once familiar with these two libraries, they are straightforward to use. However, seeing how all the pieces fit together is difficult from just the API docs, especially understanding where an explicit type signature will be necessary. This post will give a quick overview of the pieces you’ll want to be interacting with with simple, runnable examples. By the end, the goal is that you’ll be able to trivially understand the API docs themselves.
The runnable examples below will all use the Stack script interpreter support. Make sure you have Stack installed and then, for each example:
Main.hs
stack Main.hs
You’re used to dealing with a a number of different string-like
things, between strict and lazy bytestrings and text, plus
String
s. However, if I asked you to tell me how to
represent a strict sequence of bytes, you’d likely refer to
Data.ByteString.ByteString
. However, as you’ll see
through this tutorial, there are multiple data types we’ll want to
treat as a sequence of bytes.
The memory
library defines two typeclasses to help
out with this:
ByteArrayAccess
gives you read-only access to the
bytes within a data type.ByteArray
gives you read/write access, and is a
child class of ByteArrayAccess
.To demonstrate, let’s do a pointless conversion between
ByteString
and Bytes
(which we’ll explain
in a second).
#!/usr/bin/env stack -- stack --resolver lts-9.3 script {-# LANGUAGE OverloadedStrings #-} import qualified Data.ByteArray as BA import Data.ByteString (ByteString) import qualified Data.ByteString as B main :: IO () main = do B.writeFile "test.txt" "This is a test" byteString <- B.readFile "test.txt" let bytes :: BA.Bytes bytes = BA.convert byteString print bytes
We’re starting off with some file I/O using the bytestring
library (because you should really do
I/O with bytestring). Then the convert
function
can turn that into a Bytes
value.
EXERCISE What do you think the type signature
of convert
is, given the description of the two
typeclasses above? You can
check if you’re right.
Did you notice that explicit type signature I put on the
bytes
value? Well, that’s your next lesson with
memory
and cryptonite
: since so many
functions work on type classes instead of concrete types, you’ll
often end up needing to give GHC some assistance on type
inference.
I could show you an example of a data type which is a
ByteArrayAccess
but not a ByteArray
, but
it will ring hollow right now. When we get to actual hashing, the
distinction in type classes will make a lot more sense. So let’s
just wait.
You may be legitimately wondering why there’s a
Bytes
datatype in memory
, when it seems
identical to ByteString
. In fact: it’s not.
Bytes
has less memory overhead, which it gets by not
tracking the offset and length of its slice. In exchange for that:
a Bytes
value doesn’t allow for any slicing. In other
words, the drop
function on a Bytes
would
have to create a new copy of the buffer.
In other words: this is all performance stuff. And a library dealing with cryptography generally needs to be more concerned with performance.
Another interesting data type in memory
is
ScrubbedBytes
. This type has three special
properties (as called out in its Haddocks):
Show
instance that doesn’t actually show any
contentEq
instance that is constant timeIn other words: it automatically prevents a number of common security holes when dealing with sensitive data.
OK, not much code to look at here, let’s get to more fun stuff!
Let’s convert some user input into base 16 (aka hexadecimal):
#!/usr/bin/env stack -- stack --resolver lts-9.3 script import qualified Data.ByteArray as BA import Data.ByteArray.Encoding (convertToBase, Base (Base16)) import Data.ByteString (ByteString) import Data.Text.Encoding (encodeUtf8) import qualified Data.Text.IO as TIO import System.IO (hFlush, stdout) main :: IO () main = do putStr "Enter some text: " hFlush stdout text <- TIO.getLine let bs = encodeUtf8 text putStrLn $ "You entered: " ++ show bs let encoded = convertToBase Base16 bs :: ByteString putStrLn $ "Converted to base 16: " ++ show encoded
The convertToBase
will convert the contents of any
ByteArrayAccess
into a ByteArray
using
the given base. Other options here besides Base16
include Base64
and others (just
check out the docs).
As you can see, I had to put in an explicit
ByteString
type signature, since otherwise GHC
wouldn’t know which instance of ByteArrayAccess
to
use.
As you may guess, there is also a
convertFromBase
to do the opposite conversion. It
returns an Either String byteArray
value in case the
input is not in the correct format.
EXERCISE Write a program to base 16 decode its input. (Solution follows.)
#!/usr/bin/env stack -- stack --resolver lts-9.3 script import qualified Data.ByteArray as BA import Data.ByteArray.Encoding (convertFromBase, Base (Base16)) import Data.ByteString (ByteString) import Data.Text.Encoding (encodeUtf8) import qualified Data.Text.IO as TIO import System.IO (hFlush, stdout) main :: IO () main = do putStr "Enter some hexadecimal text: " hFlush stdout text <- TIO.getLine let bs = encodeUtf8 text putStrLn $ "You entered: " ++ show bs case convertFromBase Base16 bs of Left e -> error $ "Invalid input: " ++ e Right decoded -> putStrLn $ "Converted from base 16: " ++ show (decoded :: ByteString)
EXERCISE Write a program to convert input from base 16 to base 64 encoding.
Alright, that’s enough of the memory
library. Time
to do some real crypto stuff. We’re going to get the SHA256 hash
(aka digest) of some user input:
#!/usr/bin/env stack -- stack --resolver lts-9.3 script import Crypto.Hash (hash, SHA256 (..), Digest) import Data.ByteString (ByteString) import Data.Text.Encoding (encodeUtf8) import qualified Data.Text.IO as TIO import System.IO (hFlush, stdout) main :: IO () main = do putStr "Enter some text: " hFlush stdout text <- TIO.getLine let bs = encodeUtf8 text putStrLn $ "You entered: " ++ show bs let digest :: Digest SHA256 digest = hash bs putStrLn $ "SHA256 hash: " ++ show digest
We’ve used the hash
function to convert a
ByteString
(or any instance of
ByteArrayAccess
) into a Digest SHA256
. If
you’re already wondering: yes, you could replace
SHA256
with one of the other
hash algorithms available.
As before: it’s important to use a type signature of
Digest SHA256
to let GHC know what kind of hash you
want to perform. However, in this case, there’s an alternative
function you could choose instead:
#!/usr/bin/env stack -- stack --resolver lts-9.3 script import Crypto.Hash (hashWith, SHA256 (..)) import Data.ByteString (ByteString) import Data.Text.Encoding (encodeUtf8) import qualified Data.Text.IO as TIO import System.IO (hFlush, stdout) main :: IO () main = do putStr "Enter some text: " hFlush stdout text <- TIO.getLine let bs = encodeUtf8 text putStrLn $ "You entered: " ++ show bs let digest = hashWith SHA256 bs putStrLn $ "SHA256 hash: " ++ show digest
The Show
instance of Digest
will
display the digest in hexadecimal/base 16. That’s pretty nice. But
let’s suppose we want to display it in base 64 instead. Get ready
for this: Digest
is an instance of
ByteArrayAccess
, so you can use
convertToBase
. (And it’s not an instance of
ByteArray
, consider why such an instance would be
problematic. If you’re stumped: read the
docs for this function for the answer.)
EXERCISE Display the digest as a base 64 encoded string (solution follows).
#!/usr/bin/env stack -- stack --resolver lts-9.3 script import Crypto.Hash (hashWith, SHA256 (..)) import Data.ByteString (ByteString) import Data.ByteArray.Encoding (convertToBase, Base (Base64)) import Data.Text.Encoding (encodeUtf8) import qualified Data.Text.IO as TIO import System.IO (hFlush, stdout) main :: IO () main = do putStr "Enter some text: " hFlush stdout text <- TIO.getLine let bs = encodeUtf8 text putStrLn $ "You entered: " ++ show bs let digest = convertToBase Base64 (hashWith SHA256 bs) putStrLn $ "SHA256 hash: " ++ show (digest :: ByteString)
Notice how we needed the type signature on digest
to make it clear that it’s a ByteString
.
Here’s a neat little program. The user will provide a number of files as command line arguments. Then we’ll print out lists of all the files with identical content (or, at least, matching SHA256s). (Try to notice something memory-inefficient in this implementation; we’ll address it later.)
#!/usr/bin/env stack -- stack --resolver lts-9.3 script import Crypto.Hash (Digest, SHA256, hash) import qualified Data.ByteString as B import Data.Foldable (forM_) import Data.Map.Strict (Map) import qualified Data.Map.Strict as Map import System.Environment (getArgs) readFile' :: FilePath -> IO (Map (Digest SHA256) [FilePath]) readFile' fp = do bs <- B.readFile fp let digest = hash bs -- notice lack of type signature :) return $ Map.singleton digest [fp] main :: IO () main = do args <- getArgs m <- Map.unionsWith (++) <$> mapM readFile' args forM_ (Map.toList m) $ (digest, files) -> case files of [] -> error "can never happen" [_] -> return () -- only one file _ -> putStrLn $ show digest ++ ": " ++ unwords (map show files)
EXERCISE Write a program that will print out the SHA256 for every file name passed in on the command line.
QUESTION What’s the inefficiency in the code above? You’ll see in the next section.
If we tried implementing our program from above without hashing,
we’d either have to hold the entire file contents of each file in
memory at once, or do some weird O(n^2) pairwise comparisons. Our
hash-based implementation is better. But it’s still a problem: it
uses Data.ByteString.readFile
, causing possibly
unbounded memory usage. There’s a more efficient way to hash entire
files, using cryptonite-conduit
:
#!/usr/bin/env stack -- stack --resolver lts-9.3 script import Crypto.Hash (Digest, SHA256, hash) import Crypto.Hash.Conduit (hashFile) import Data.Foldable (forM_) import Data.Map.Strict (Map) import qualified Data.Map.Strict as Map import System.Environment (getArgs) readFile' :: FilePath -> IO (Map (Digest SHA256) [FilePath]) readFile' fp = do digest <- hashFile fp return $ Map.singleton digest [fp] main :: IO () main = do args <- getArgs m <- Map.unionsWith (++) <$> mapM readFile' args forM_ (Map.toList m) $ (digest, files) -> case files of [] -> error "can never happen" [_] -> return () -- only one file _ -> putStrLn $ show digest ++ ": " ++ unwords (map show files)
Pretty simple change (in fact, I’d argue this code is just slightly easier to read), and we get far better memory performance (linear in the number of files being compared, constant in the size of those files).
Perhaps your ears (or eyes? you’re probably reading this) perked up at the mention of conduit. To answer the question I’m going to pretend you’re asking: yes, you can do streaming computation of a hash. Here’s a program that will take a URL and file path, write the contents of the URL’s response body to a file path, and print out the SHA256 digest. And the cool part: it will only look at each chunk of data once.
#!/usr/bin/env stack -- stack --resolver lts-9.3 script import Conduit import Crypto.Hash (Digest, SHA256, hash) import Crypto.Hash.Conduit (sinkHash) import Network.HTTP.Simple import System.Environment (getArgs) main :: IO () main = do args <- getArgs (url, fp) <- case args of [x, y] -> return (x, y) _ -> error $ "Expected: URL FILEPATH" req <- parseRequest url digest <- runResourceT $ httpSink req $ _res -> getZipSink $ ZipSink (sinkFile fp) *> ZipSink sinkHash print (digest :: Digest SHA256)
Of course, if conduit can do it, you can do it too. Let’s write
a hashFile
implementation ourselves without conduit to
get some exposure to the raw hashing API:
#!/usr/bin/env stack -- stack --resolver lts-9.3 script import Crypto.Hash import System.Environment (getArgs) import System.IO (withBinaryFile, IOMode (ReadMode)) import Data.Foldable (forM_) import qualified Data.ByteString as B hashFile :: HashAlgorithm ha => FilePath -> IO (Digest ha) hashFile fp = withBinaryFile fp ReadMode $ h -> let loop context = do chunk <- B.hGetSome h 4096 if B.null chunk then return $ hashFinalize context else loop $! hashUpdate context chunk in loop hashInit main :: IO () main = do args <- getArgs forM_ args $ fp -> do digest <- hashFile fp putStrLn $ show (digest :: Digest SHA256) ++ " " ++ fp
This uses the pure update API provided by
Crypto.Hash
. We can also use a mutating API in this
case, which is slightly more efficient by bypassing some buffer
copies:
#!/usr/bin/env stack -- stack --resolver lts-9.3 script import Crypto.Hash import Crypto.Hash.IO import System.Environment (getArgs) import System.IO (withBinaryFile, IOMode (ReadMode)) import Data.Foldable (forM_) import qualified Data.ByteString as B hashFile :: HashAlgorithm ha => FilePath -> IO (Digest ha) hashFile fp = withBinaryFile fp ReadMode $ h -> do context <- hashMutableInit let loop = do chunk <- B.hGetSome h 4096 if B.null chunk then hashMutableFinalize context else do hashMutableUpdate context chunk loop loop main :: IO () main = do args <- getArgs forM_ args $ fp -> do digest <- hashFile fp putStrLn $ show (digest :: Digest SHA256) ++ " " ++ fp
EXERCISE Use lazy I/O and the
hashlazy
function to implement hashFile
.
(NOTE: I am not condoning lazy I/O here.)
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.