TL;DR Writing parser combinator library in Clojure from scratch
Recently I’ve had an fortunate opportunity to participate in a commercial project that was heavy on parsing and I’ve decided to use Clojure as my main tool for doing it. As you might all know Clojure is not new to the game and is quite mature functional programming language with a hidden power of Lisp. One of the distinct features of Clojure which I personally adore is nrepl and Interactive Programming incremental development style which it cultivates. This is by no means something new, Common Lisp has it via slime/swank, some of Scheme implementations have geiser support and almost all Smalltalk variants could be considered as full interactive programming environments, one notable example being Pharo. There are some more specialized interactive programming environments like Processing and SuperCollider which deal with either sound or visual or even both but today we’ll be talking about parsing and all things related instead. Before I’ll start explaining what we are actually trying to build I would like to introduce you to a really nice open source library that I’ve successfully used in a commercial project I’ve mentioned before, called kern. But as all things good there is always some value in trying to build your own even better, at least for me it was an insightful journey. So yeah as title says we’ll be building parser combinator library in Clojure just to learn a thing or two but solely just for fun. The full source for impatient ones is in my ccp repo. So let’s get started then.
I’m no good with long explanations and to keep things simple I’ll start with two references that I’ve heavily used while building ccp. One is being kern as you’ve guessed it and the second one is parsec. Both are excellent parser combinator libraries and are good references for self-study. Before we’ve delve into what is parser combinator I would like to talk about a simple concept of a parser. Now what is a parser? Something that parses a thing and returns result as you’ve guessed it, right?! Wait, isn’t that a function of a –> b. Remarkable similarity, indeed. A function takes some arbitrary data a, parsers it and returns result b. Wait, what if it can’t parse data and return result? Then we’ll return an error instead of parsing result. In Haskell our type definition for parser would be something like this
1
|
|
As some you of you may have noticed our Parser is actually a ReaderT a Either String b monad transformer in disguise. As we are parsing a we’ve also need to constantly mutate it as it’s being parsed so probably we are mistaken and we actually need an StateT a Either String b monad transformer instead, but wait, aren’t we going to build everything in Clojure? Yeah, right so we don’t need any strict type definitions. We’ll just use simple function that takes data as input and returns parsing result and will mutate state of input data as we go. And for errors will have some special deftype ParserError. Any parser that can’t parse data will return ParserError with detailed error message description.
1
|
|
What is this ps and state? Patience my friend I’ll explain it later. First let’s talk about data we are going to parse. What should it be? Well it can be literally anything but we need some contract in order to parse it bit by bit right? So let’s introduce a protocol called ParseStream. Any type that implements our ParseStream protocol is good enough to be an input data for our parser. We can go back and forth element by element traversing this data stream using next-elem, prev-elem and current-elem methods. Our ParseStream implementors need to track state and mutate it as we go over the data with parsers. This is actually a data + state concept. As we are writing everything in Clojure, we’ll just use deftype with mutable state variable and data. I know about dangers of using mutable variables but as we are only giving limited access to our type via strict protocol there is no way our parser can harm our inner ParseStream state in any way, so we are kinda safe. In order to be able to produce a ParseStream out of any data we’ll introduce another protocol called ParseStreamFactory. Any type/object that implements this protocol will be convertible into a valid ParseStream. As any ParseStream tracks state it can reset itself to initial position using reset method, and record and restore it’s state using get-state and restore-state methods. We can also check if it’s state is initial or ending using start-state? and end-state? methods. Actually our ParseStream is just some good old Iterator pattern from Object-Orient world. We are combining Functional and Object-Oriented programming using Clojure in order to produce the most elegant solution to the given problem at hand. This method of programming is quite powerful, best of both worlds indeed. Now you know what ps and state in ParserError is, it’s ParseStream data and it’s state when the ParserError occurred. Also we would like to have a human readable description of parser error so our ParseStream will have describe-error method too. We’ll also define some helper methods to work with ParserError.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
Now let’s define String implementation of ParseStream. Our StringParseStream will go character by character incrementing/decrementing indexed position p. The code is pretty straightforward. We just save state of parse stream into index position starting at -1 using unsynchronized-mutable variable p. We don’t need any thread-safety as our parse stream is guaranteed to be used only by one thread. Also in order to describe error we need to correctly inform our parsing library user about exact location in String where that error occurred. For this we use private function line-col which counts lines and columns and returns exact line and column number of that index position.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
Now let’s implement the same but for any arbitrary Clojure sequence. Note how describe-error method differs compared to StringParseStream implementation. Rest of the code is almost exactly the same except that we are using sequence related functions instead of String related ones.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Now in order for most types to be convertible into ParseStream we need to extend them to implement ParseStreamFactory protocol.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
Before we’ll start writing our sample parsers I would like to introduce an macro that will help us in minimizing our code needed for writing almost any parser. This macro is called parsed and is simply a let + if combination that gives us ability to define a variable e initialized to some parsed value p, check if this value is parser error or not and depending on that return either t or f expression.
1 2 3 |
|
Let’s define our first parser generating function called satisfy. Given a predicate function pd and m message it will give back a parser that will get next element from ParseStream, check if it satisfies the predicate. If it does it we’ll return it, otherwise it will return ParserError with message m. Note how we would like to support two kinds of predicate functions, with 1 argument and with 2 arguments second being actual ParseStream itself. To distiguish those two we’ll use another private function called n-args.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Note how satisfy is both suitable for String streams as well as sequence streams. Before we’ll define some simple String stream oriented parsers let’s define one more crucial function that is necessary to define our parsers. What this function does is that it takes a parser p, saves ParseStream ps’s state into st variable, applies p to ps, checks if result is ParserError, and if it is, it restores ps’s state to value it had before p’s application. Now simply put this function allows us to convert any parser that might fail into a parser that doesn’t affects ParseStream in case of failure and restores it’s state to the way it was before parser was applied to it. We can also define restore parser transformer that will restore stream every time. This parser transformers are actually called a parser combinator. So what is parser combinator? It’s a function that takes as an input one or many parser functions and returns a parser that is result of combining those parsers. By combining we actually mean that parser combinator introduces an behaviour over already defined parser behaviour. This allows us to stack/combine parsers like a lego pieces and it’s actually a very very powerful concept. In essence this concept is actually the real corner stone of functional programming and it’s conceptually heart of power.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Note that we aren’t limited to parser(s) as input for a parser combinator as we can have any kind of value that drives resulting behaviour of our parser. In that essence the function satisfy that we’ve defined before is also an parser combinator. Okey zen reached, let’s move on already! Now that we know what parser combinators actually are let’s define some simple parser combinator called chr that can give us parser for any specific character. We also define satisfy-char parser combinator that is actually used to be sure that value we are trying to parse is of character type. It’s not strictly necessary for chr but we’ll define it anyway.
1 2 3 4 5 6 7 8 9 |
|
Now using those parser combinators we can define some commonly used parsers
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Let’s try using parsers and parser combinators that we’ve defined so far. For this we’ll use parse function that is actually a helper to convert input data to ParseStream if it isn’t already a ParseStream and apply p to it
1 2 3 4 5 6 |
|
Now some simple tests
1 2 3 4 5 6 7 8 9 10 11 |
|
Sometimes we need our parser to return a value different from the one that has been parsed. Let’s define two parser combinators one called return which substitutes a return value with provided one and another one called null which just returns nil instead of parse result
1 2 3 4 5 6 7 8 9 |
|
What if we want to match beginning or end of ParseStream. We can do that since our ParseStream protocol has two methods called start-state? and end-state?.
1 2 3 4 5 6 7 |
|
But what if we want to just generally parse any arbitary data, for example while parsing a SequenceParseStream. We can do that too. We can even parse data with some particular elem using = and return it as parse result or some optionally specified failure message with name of the element we needed our parser to be equal to.
1 2 3 4 5 6 7 8 |
|
We can even parse beginning and end of line.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Wait, this all starts remind us about regular expressions, right?! Well ofcourse we can even make a parser that will be able to match some regular expression. But we have two limitations here, in order for our regular expression parser to be able to work with standard Java regular expression methods it needs to be String or any char element based ParseStream and it needs to implement CharSequence interface. So let’s define ParseStreamCharSequenceFactory protocol and implement StringParseStreamCharSequence deftype to extend StringParseStream to be able to convert it to CharSequence at any time when needed. This functionality will be used for our match parser combinator that will have regular expression as it’s input and will return parser that can match that regular expression. Lots of code, I know. This is probably the most complicated parser combinator in entire ccp library. I won’t start explaining every bit of it and leave that to you guys as a self study.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
Let’s define another parser combinator. This one we’ll call collect and it’ll just collect exactly n number of elements from parse stream and will return vector containing those elements. Note how we are using transient vector as accumulator in our loop/recur. Using transient vector is more effective performance-wise as we don’t need full power of persistent data structure and instead just want to accumulate result.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
We can do the same but instead of collecting result just consume it and return nil.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
We can even parse exactly some string
1 2 3 4 5 6 7 8 9 10 11 |
|
What if we want to parse one or more parsers and return value only of the last one? Well we can define a parser combinator that will allow us to do that.
1 2 3 4 5 6 7 8 |
|
And exactly same but return first parser result instead.
1 2 3 4 5 6 7 8 9 10 |
|
Now let’s define some basic yet very powerful monadic bind >>= parser combinator. Note how we apply function to result of the parse result of our parser. We can even have two kinds of functions, one with parse stream as second result and one without it. We could have defined our << and >> parser combinators using >>= parser combinator but our previously define loop/recur version will work faster and can grow infinitely without consuming the stack.
1 2 3 4 5 6 7 8 |
|
This can be used to define a letp macro which is actually a parser combinator too in some sense, well not exactly but close.
1 2 3 4 5 6 7 8 9 |
|
This can be used to manipulate parser results in any way and return result any way we want to.
1 2 3 4 5 6 7 8 |
|
How about some sequential parser combinator that collects result into a vector. This one is kinda similar to collect but applies parsers instead of counting number of elements.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
We can use <*> to define <str*> and <keyword*> parser combinators. Those just apply str and keyword functions to parser results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
Now let’s define yet another crucial alternatives <|> parser combinator. This takes any number of parsers as input, and when parsing, tries each one of them and stops when either all of them fail or one of them consumes input while failing. The code is quite complicated as we need to deal with generating error message as well as checking state of parse stream each time we are trying to parse some data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Speaking of parser failures what if we want to get a failure message out of parser without actually applying parser to a parse stream. Well we have some edge cases but still we can do that with a really simple trick by making a parser combinator that can make any parser fail. We’ll just define some empty string stream and will use it to apply parser to it and return it’s error message as if it was applied to parse stream itself. We can then have fail-message function that will return the failure message itself.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
Now that we have fail-message function we can define no parser combinator that will make any parser fail in case it succeeds and will succeed in case of failure.
1 2 3 4 5 |
|
Sometimes we want some optional value so why note define an option parser combinator too.
1 2 3 4 5 6 7 |
|
What if we want to repetitively parse some value one or more times and return vector of results. We can do that with many and many1 parser combinators.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Our next parser combinator is quite complex so I’ll try to explain it in more detail. Sometimes when parsing programming languages or any symbolic expressions we need our operators to have different precedence based on some predefined rules that we have. For example in math expression like 1+2*3 will be interpreted as (+ 1 (* 2 3)) since * operator has higher precedence compared to + operator. In order parse this kind of expressions we’ll define two special parser combinators called op and op-expr that need to be used with each other. op parser combinator converts any parser that can parse some operator into a parser that returns parsed operator and it’s precedence value. This value is then used by op-expr in order return expression in exact precedence order as define by op parsers. The code is kinda complex and I’ll live it as exercise for you to study and understand.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
|
I can go on with defining this parsers but this will make me only tired and this post only longer and hard to read so instead I’ll just stop and show you guys something really awesome. This is exp parser combinator that can generate parsers from a simple Parser Expression Language inspired by Parsing Expression Grammar and Regular Expressions. I won’t describe it as it’s still work-in-progress and might change in the future. Instead I’ll show you some examples of using it. Those interested in source code should look in git repo exp.clj file also note that what you might see there is quite ugly and complex but I hope to improve it in the future.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
|
That’s it folks, we’ve defined a clean, simple and yet beautiful parser combinator library that is easy to grasp. Happy parsing!