The cryptonite
library is the de facto standard in Haskell today for
cryptography. It provides support for secure random number generic,
shared key and public key encryption, message authentification
codes (MACs), and—for our purposes today—cryptographic hashes.
For those unfamiliar: a hash function is a function that maps
data from arbitrary size to a fixed size. A cryptographic hash
function is a hash function with properties suitable for
cryptography (see Wikipedia
article for more details). A common example of cryptographic
hash usage is providing a checksum on a file download to ensure it
has not be tampered with. Common algorithms in use today include
SHA256, Skein512, and (the slightly outdated) MD5.
The cryptonite library is built on top of the memory library, which
provides type classes and convenience functions for dealing with
reading and creating byte arrays. You may initially think
"shouldn't it all be ByteString
s." We'll get to why
the type classes are so helpful later.
Once familiar with these two libraries, they are straightforward
to use. However, seeing how all the pieces fit together is
difficult from just the API docs, especially understanding where an
explicit type signature will be necessary. This post will give a
quick overview of the pieces you'll want to be interacting with
with simple, runnable examples. By the end, the goal is that you'll
be able to trivially understand the API docs themselves.
The runnable examples below will all use the Stack script
interpreter support. Make sure you have Stack installed and
then, for each example:
- Copy the contents into a file called
Main.hs
- Run
stack Main.hs
Basic typeclasses
You're used to dealing with a a number of different string-like
things, between strict and lazy bytestrings and text, plus
String
s. However, if I asked you to tell me how to
represent a strict sequence of bytes, you'd likely refer to
Data.ByteString.ByteString
. However, as you'll see
through this tutorial, there are multiple data types we'll want to
treat as a sequence of bytes.
The memory
library defines two typeclasses to help
out with this:
-
ByteArrayAccess
gives you read-only access to the
bytes within a data type.
-
ByteArray
gives you read/write access, and is a
child class of ByteArrayAccess
.
To demonstrate, let's do a pointless conversion between
ByteString
and Bytes
(which we'll explain
in a second).
#!/usr/bin/env stack
import qualified Data.ByteArray as BA
import Data.ByteString (ByteString)
import qualified Data.ByteString as B
main :: IO ()
main = do
B.writeFile "test.txt" "This is a test"
byteString <- B.readFile "test.txt"
let bytes :: BA.Bytes
bytes = BA.convert byteString
print bytes
We're starting off with some file I/O using the bytestring
library (because you should really do
I/O with bytestring). Then the convert
function
can turn that into a Bytes
value.
EXERCISE What do you think the type signature
of convert
is, given the description of the two
typeclasses above? You can
check if you're right.
Did you notice that explicit type signature I put on the
bytes
value? Well, that's your next lesson with
memory
and cryptonite
: since so many
functions work on type classes instead of concrete types, you'll
often end up needing to give GHC some assistance on type
inference.
I could show you an example of a data type which is a
ByteArrayAccess
but not a ByteArray
, but
it will ring hollow right now. When we get to actual hashing, the
distinction in type classes will make a lot more sense. So let's
just wait.
Why the different data
types?
You may be legitimately wondering why there's a
Bytes
datatype in memory
, when it seems
identical to ByteString
. In fact: it's not.
Bytes
has less memory overhead, which it gets by not
tracking the offset and length of its slice. In exchange for that:
a Bytes
value doesn't allow for any slicing. In other
words, the drop
function on a Bytes
would
have to create a new copy of the buffer.
In other words: this is all performance stuff. And a library
dealing with cryptography generally needs to be more concerned with
performance.
Another interesting data type in memory
is
ScrubbedBytes
. This type has three special
properties (as called out in its Haddocks):
- Being scrubbed after its goes out of scope.
- A
Show
instance that doesn't actually show any
content
- A
Eq
instance that is constant time
In other words: it automatically prevents a number of common
security holes when dealing with sensitive data.
OK, not much code to look at here, let's get to more fun
stuff!
Base 16 encoding/decoding
Let's convert some user input into base 16 (aka
hexadecimal):
#!/usr/bin/env stack
import qualified Data.ByteArray as BA
import Data.ByteArray.Encoding (convertToBase, Base (Base16))
import Data.ByteString (ByteString)
import Data.Text.Encoding (encodeUtf8)
import qualified Data.Text.IO as TIO
import System.IO (hFlush, stdout)
main :: IO ()
main = do
putStr "Enter some text: "
hFlush stdout
text <- TIO.getLine
let bs = encodeUtf8 text
putStrLn $ "You entered: " ++ show bs
let encoded = convertToBase Base16 bs :: ByteString
putStrLn $ "Converted to base 16: " ++ show encoded
The convertToBase
will convert the contents of any
ByteArrayAccess
into a ByteArray
using
the given base. Other options here besides Base16
include Base64
and others (just
check out the docs).
As you can see, I had to put in an explicit
ByteString
type signature, since otherwise GHC
wouldn't know which instance of ByteArrayAccess
to
use.
As you may guess, there is also a
convertFromBase
to do the opposite conversion. It
returns an Either String byteArray
value in case the
input is not in the correct format.
EXERCISE Write a program to base 16 decode its
input. (Solution follows.)
#!/usr/bin/env stack
import qualified Data.ByteArray as BA
import Data.ByteArray.Encoding (convertFromBase, Base (Base16))
import Data.ByteString (ByteString)
import Data.Text.Encoding (encodeUtf8)
import qualified Data.Text.IO as TIO
import System.IO (hFlush, stdout)
main :: IO ()
main = do
putStr "Enter some hexadecimal text: "
hFlush stdout
text <- TIO.getLine
let bs = encodeUtf8 text
putStrLn $ "You entered: " ++ show bs
case convertFromBase Base16 bs of
Left e -> error $ "Invalid input: " ++ e
Right decoded ->
putStrLn $ "Converted from base 16: " ++ show (decoded :: ByteString)
EXERCISE Write a program to convert input from
base 16 to base 64 encoding.
Hashing a strict
bytestring
Alright, that's enough of the memory
library. Time
to do some real crypto stuff. We're going to get the SHA256 hash
(aka digest) of some user input:
#!/usr/bin/env stack
import Crypto.Hash (hash, SHA256 (..), Digest)
import Data.ByteString (ByteString)
import Data.Text.Encoding (encodeUtf8)
import qualified Data.Text.IO as TIO
import System.IO (hFlush, stdout)
main :: IO ()
main = do
putStr "Enter some text: "
hFlush stdout
text <- TIO.getLine
let bs = encodeUtf8 text
putStrLn $ "You entered: " ++ show bs
let digest :: Digest SHA256
digest = hash bs
putStrLn $ "SHA256 hash: " ++ show digest
We've used the hash
function to convert a
ByteString
(or any instance of
ByteArrayAccess
) into a Digest SHA256
. If
you're already wondering: yes, you could replace
SHA256
with one of the other
hash algorithms available.
As before: it's important to use a type signature of
Digest SHA256
to let GHC know what kind of hash you
want to perform. However, in this case, there's an alternative
function you could choose instead:
#!/usr/bin/env stack
import Crypto.Hash (hashWith, SHA256 (..))
import Data.ByteString (ByteString)
import Data.Text.Encoding (encodeUtf8)
import qualified Data.Text.IO as TIO
import System.IO (hFlush, stdout)
main :: IO ()
main = do
putStr "Enter some text: "
hFlush stdout
text <- TIO.getLine
let bs = encodeUtf8 text
putStrLn $ "You entered: " ++ show bs
let digest = hashWith SHA256 bs
putStrLn $ "SHA256 hash: " ++ show digest
The Show
instance of Digest
will
display the digest in hexadecimal/base 16. That's pretty nice. But
let's suppose we want to display it in base 64 instead. Get ready
for this: Digest
is an instance of
ByteArrayAccess
, so you can use
convertToBase
. (And it's not an instance of
ByteArray
, consider why such an instance would be
problematic. If you're stumped: read the
docs for this function for the answer.)
EXERCISE Display the digest as a base 64
encoded string (solution follows).
#!/usr/bin/env stack
import Crypto.Hash (hashWith, SHA256 (..))
import Data.ByteString (ByteString)
import Data.ByteArray.Encoding (convertToBase, Base (Base64))
import Data.Text.Encoding (encodeUtf8)
import qualified Data.Text.IO as TIO
import System.IO (hFlush, stdout)
main :: IO ()
main = do
putStr "Enter some text: "
hFlush stdout
text <- TIO.getLine
let bs = encodeUtf8 text
putStrLn $ "You entered: " ++ show bs
let digest = convertToBase Base64 (hashWith SHA256 bs)
putStrLn $ "SHA256 hash: " ++ show (digest :: ByteString)
Notice how we needed the type signature on digest
to make it clear that it's a ByteString
.
Check if any files match
Here's a neat little program. The user will provide a number of
files as command line arguments. Then we'll print out lists of all
the files with identical content (or, at least, matching SHA256s).
(Try to notice something memory-inefficient in this implementation;
we'll address it later.)
#!/usr/bin/env stack
import Crypto.Hash (Digest, SHA256, hash)
import qualified Data.ByteString as B
import Data.Foldable (forM_)
import Data.Map.Strict (Map)
import qualified Data.Map.Strict as Map
import System.Environment (getArgs)
readFile' :: FilePath -> IO (Map (Digest SHA256) [FilePath])
readFile' fp = do
bs <- B.readFile fp
let digest = hash bs
return $ Map.singleton digest [fp]
main :: IO ()
main = do
args <- getArgs
m <- Map.unionsWith (++) <$> mapM readFile' args
forM_ (Map.toList m) $ \(digest, files) ->
case files of
[] -> error "can never happen"
[_] -> return ()
_ -> putStrLn $ show digest ++ ": " ++ unwords (map show files)
EXERCISE Write a program that will print out
the SHA256 for every file name passed in on the command line.
QUESTION What's the inefficiency in the code
above? You'll see in the next section.
More efficient file
hashing
If we tried implementing our program from above without hashing,
we'd either have to hold the entire file contents of each file in
memory at once, or do some weird O(n^2) pairwise comparisons. Our
hash-based implementation is better. But it's still a problem: it
uses Data.ByteString.readFile
, causing possibly
unbounded memory usage. There's a more efficient way to hash entire
files, using cryptonite-conduit
:
#!/usr/bin/env stack
import Crypto.Hash (Digest, SHA256, hash)
import Crypto.Hash.Conduit (hashFile)
import Data.Foldable (forM_)
import Data.Map.Strict (Map)
import qualified Data.Map.Strict as Map
import System.Environment (getArgs)
readFile' :: FilePath -> IO (Map (Digest SHA256) [FilePath])
readFile' fp = do
digest <- hashFile fp
return $ Map.singleton digest [fp]
main :: IO ()
main = do
args <- getArgs
m <- Map.unionsWith (++) <$> mapM readFile' args
forM_ (Map.toList m) $ \(digest, files) ->
case files of
[] -> error "can never happen"
[_] -> return ()
_ -> putStrLn $ show digest ++ ": " ++ unwords (map show files)
Pretty simple change (in fact, I'd argue this code is just
slightly easier to read), and we get far better memory performance
(linear in the number of files being compared, constant in the size
of those files).
Streaming hashing
Perhaps your ears (or eyes? you're probably reading this) perked
up at the mention of conduit. To answer the question I'm going to
pretend you're asking: yes, you can do streaming computation of a
hash. Here's a program that will take a URL and file path, write
the contents of the URL's response body to a file path, and print
out the SHA256 digest. And the cool part: it will only look at each
chunk of data once.
#!/usr/bin/env stack
import Conduit
import Crypto.Hash (Digest, SHA256, hash)
import Crypto.Hash.Conduit (sinkHash)
import Network.HTTP.Simple
import System.Environment (getArgs)
main :: IO ()
main = do
args <- getArgs
(url, fp) <-
case args of
[x, y] -> return (x, y)
_ -> error $ "Expected: URL FILEPATH"
req <- parseRequest url
digest <- runResourceT $ httpSink req $ \_res -> getZipSink $
ZipSink (sinkFile fp) *>
ZipSink sinkHash
print (digest :: Digest SHA256)
Of course, if conduit can do it, you can do it too. Let's write
a hashFile
implementation ourselves without conduit to
get some exposure to the raw hashing API:
#!/usr/bin/env stack
import Crypto.Hash
import System.Environment (getArgs)
import System.IO (withBinaryFile, IOMode (ReadMode))
import Data.Foldable (forM_)
import qualified Data.ByteString as B
hashFile :: HashAlgorithm ha => FilePath -> IO (Digest ha)
hashFile fp = withBinaryFile fp ReadMode $ \h ->
let loop context = do
chunk <- B.hGetSome h 4096
if B.null chunk
then return $ hashFinalize context
else loop $! hashUpdate context chunk
in loop hashInit
main :: IO ()
main = do
args <- getArgs
forM_ args $ \fp -> do
digest <- hashFile fp
putStrLn $ show (digest :: Digest SHA256) ++ " " ++ fp
This uses the pure update API provided by
Crypto.Hash
. We can also use a mutating API in this
case, which is slightly more efficient by bypassing some buffer
copies:
#!/usr/bin/env stack
import Crypto.Hash
import Crypto.Hash.IO
import System.Environment (getArgs)
import System.IO (withBinaryFile, IOMode (ReadMode))
import Data.Foldable (forM_)
import qualified Data.ByteString as B
hashFile :: HashAlgorithm ha => FilePath -> IO (Digest ha)
hashFile fp = withBinaryFile fp ReadMode $ \h -> do
context <- hashMutableInit
let loop = do
chunk <- B.hGetSome h 4096
if B.null chunk
then hashMutableFinalize context
else do
hashMutableUpdate context chunk
loop
loop
main :: IO ()
main = do
args <- getArgs
forM_ args $ \fp -> do
digest <- hashFile fp
putStrLn $ show (digest :: Digest SHA256) ++ " " ++ fp
EXERCISE Use lazy I/O and the
hashlazy
function to implement hashFile
.
(NOTE: I am not condoning lazy I/O here.)
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.
Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.