Like many other Haskellers even after considering myself an expert in the language I still get that amazing “wow” moment, when I learn another elegant thing about its basic concepts, from time to time.
Recently I was working on a project, which required a primitive parsing functionality, and I discovered that amazing results in terms of performance can be achieved using the good old State monad. This post is about what I discovered and how.
Contents
Introduction
So I was working on a library of codecs for the PostgreSQL’s Binary Format as part of the Hasql’s ecosystem. The first thing I discovered was that Postgres’ authors are very bad at documentation. Google all day, and you won’t find even a reminder of specs of the Binary Format. Nada. Though very frustrated by that fact I was still determined, so I found a solution.
After digging around I found the “libpqtypes” library, which implemented the format in C. Luckily it was open source, so I decided to use it as a sort of a spec. My discoveries concerning the State monad hit me, when I was translating that C code.
About binary encodings
Prior to going further I have to say a little about binary encodings. An important thing about them is that unlike human-readable formats they are strictly positional. This means that contexts are always unambiguous and are always declared ahead. One typical example is the following:
byte1, byte2, byte3, byte4, byte5, byte6, ..., byteN
HEAD | BODY
Where the first four bytes are the head, which encodes a 32-bit integer (8 * 4 = 32). This integer specifies the amount of the following bytes, which make up the body. While this isn’t the only possible scenario, it proves that it is possible to encode a data of an arbitrary length without any delimiters. So no quotes, braces, commas, dashes or anything like that to distinguish contexts and no escaping is required.
Here’s another example:
byte1, byte2, byte3, ..., byteN
HEAD | BODY
The single byte of the head is a Word8, which determines the alternative scenarios of how the body should be decoded. E.g., if it is 0
, then the body is a text, if it is 1
, then it is a list of ints and etc.
Turns out, mixing the aforementioned strategies together gives you all you need to encode pretty much anything. This means that a parser for this encoding will require no backtracking or alternatives, or any other complex strategies needed to parse human-readable formats, and this is what makes the State monad sufficient for implementing a binary parser.
The parser
Let’s start with a code snippet from the “libpqtypes” library. It is extracted from a function, which is supposed to decode a value of Postgres’ interval type from a binary format.
So what’s going on in this piece of code? It’s easy to spot that in the first step they must be doing something with the first 8 bytes, the following step somehow associates the next 4 bytes with days and the next step associates another 4 bytes with months. I already know that the first 8-byte section actually encodes an amount of microseconds as an integer.
Let’s start translating it by extracting the sections from a byte string:
Now we have an input byte string split into 3 useful sections, which we can later process. When talking about such things as sections or parts though, a seasoned Haskeller might instinctively start thinking about composition, but hold that thought for now.
In our code there’s already an easily spottable pattern and an annoying and error-prone noise caused by the explicit updating of byteString
. Imagine having more sections to split into and accidentally confusing byteString'
with byteString''
- that’s a bug the compiler won’t spot for you, since the types are the same.
Turns out, this pattern is abstracted from long ago, and it’s the place where the State monad steps in. Let’s refactor.
So much better. Sure, it’s one extra line of code, but it’s so much easier to reason about. But wait, there’s a new pattern right there, let’s refactor again.
And here comes the composition! A smaller independent piece is used to make up a bigger one of the same type.
Okay, so we have our three sections now, but it’s still all binary data, which we need to decode. I already know that months and days are integers encoded in the Big Endian format. A decoder for it is pretty simple:
I won’t dive into details of the implementation here, since it’s not related to the subject. Just accept it as a given.
Using this function we can now decode each section into a number:
Another pattern. Refactoring:
Still a pattern. Refactoring:
And now it seriously starts to look like a typical parser library in action. You might have spotted though that during our last refactoring we’ve introduced a little overhead by redundantly splitting the bytestring remainder in the months parser, but it’s a little price to pay for a general solution.
Let’s finish our parser:
There’s one stylistic problem left: we’re mixing a monadic code with a pure computation. It’s always better to isolate those things. So here comes the Applicative style and one last refactoring:
Look at the where section. There’s a whole library right there! Let’s export it.
Neat! Our own library of composable components. We could of course spend our day shifting indexes, constantly repeating ourselves and hoping we don’t mess up, like they do in C, but that’s not how we roll in Haskell! We make reusable libraries out of nowhere, without even intending to do so.
Do you feel how intuitive the code we’ve just produced is? I mean, an interval is a computation performed on ints of sizes 8, 4 and 4. An int is a certain decoding performed on a bytestring of some length. A bytestring of some length is simply a splitting of a part from the beginning of the input bytestring. It’s all there in the code stated almost just as I said. That’s what the Declarative Programming is all about. A program is no longer a set of index shifts, loops and temporary variables, and all that kinda noise, it is a composition of clear and simple definitions.
You might be wondering now about the parsers that can fail. Turns out, the following updates are all it takes to implement the support for that:
Want your parser to be a transformer? Use EitherT instead.
Now, as I understand, the mysterious Zepto parser from the “attoparsec” library is essentially the same thing as the one we’ve just implemented. In the benchmarks I ran it was performing on par. However the performance of both Zepto and our own parser for the described task is an order of magnitude better compared to Attoparsec, forget about Parsec.
An extra taste
Aside from everything related to parsers I want to give another example I extracted during that project of how nicely the State monad abstracts from other typical imperative patterns.
C (from libpqtypes):
Haskell:
And another more practical function:
You might have noticed a certain similarity between the splitAt and the divMod functions - it’s that they both result in a pair. So the next time you’ll have to deal with a function returning a pair, you will know that the State monad might well turn out to be not any less useful for its composition.
Cheers!