TL;DR Rewriting micro compiler in OCaml
In my previous post I’ve talked about writing micro compiler in OCaml under 300 lines of source code. There are number of ways to make our work easier and number of source code lines significantly smaller.
Let’s rewrite our micro compiler using tools called lexer and parser generators. We’ll be using tools called ocamllex and ocamlyacc which are distributed with OCaml compiler and are modeled after famous lex and yacc tools for Unix operating systems. Those tools actually have better modern analogues called flex and bison which are described in detail in Flex & Bison: Text Processing Tools book. Nowadays however if you are writing a professional compiler in OCaml I strongly suggest you consider using sedlex and menhir instead of ocamllex and ocamlyacc as both tools are quite outdated and lack some significant features that their modern analogues have such as unicode support for lexing, parameterized parser generation and built-in grammar interpreter. So what are lexer and parser generators? To put it simply ocamllex and ocamlyacc take special .mll and .mly definition files of lexer and parser semantics mixed with OCaml source code and generate an .ml source code files that do the actual token generation and parsing for you. Pretty neat indeed, and it’s actually easier to use than it sounds so let’s rewrite our micro compiler using those tools. We’ll be using original source code of micro compiler as reference only as entire code base needs to be changed. You can see the end result of our rewrite in micro git repository branch called simple. For the actual description of the micro language see my previous post. So let’s get started!
First let’s create our files for lexer and parser called lexer.mll and parser.mly. Now let’s define our tokens in parser.mly. As you can see it’s defined using special syntax that is described in ocamlyacc manual.
1 2 3 4 5 6 7 8 |
|
Now let’s define lexical semantics for those tokens in lexer.mll file. The tokens we defined in parser.mly will be generated into parser.mli interface file so first of all let’s include those from parser module by adding a header to lexer.mll file. The part between { and } is defined in OCaml syntax and will be translated into the generated lexer.ml file without any changes. For the description of ocamllex definition file consult it’s manual
1 2 3 4 5 |
|
Next let’s define some lexical semantics next for blank characters, digits and alpha numerical characters. As you can see those are defined using regular expression syntax and can be referencing each other. This way we can define identifier lexical semantics by just referencing alpha and digit definitions.
1 2 3 4 5 |
|
Now let’s define a rule that will give us the actual tokens. As you can see the part between { and } is in OCaml syntax and just gives us back the tokens we’ve defined.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Now let’s add the line counting and syntax error reporting to our lexer.mll header.
1 2 3 4 5 6 |
|
Next we’ll add the rule for counting new lines and reporting syntax errors if lexer encounters unknown token. Also we need to generate EOF token when lexer will encounter the end of file. To skip blank characters, as those aren’t need in our micro compiler, we’ll just recursively call lexer’s micro rule providing it with lexbuf which is described in manual.
1 2 3 4 5 6 |
|
Woah woaw wow, much forget, wait!!! Aren’t we missing something?! Ah yes, identifiers, indeed! First of all let’s define a table of known keywords in the header as we need to handle them differently.
1 2 3 4 |
|
And now let’s handle the actual identifier tokens.
1 2 3 4 5 6 7 8 |
|
Congratulations we have a lexer ready, now it’s a parsing time! Let’s define a parser for our micro code. Parser definition syntax is pretty straightforward and reminds kinds of BNF definition. If you don’t understand it just by looking at it or don’t know what BNF is I suggest you take a look into description of definition syntax by consulting ocamlyacc manual. As you can see our parser starts with a program definition which starts with begin statement followed by statements and followed by end statement definitions and ends with EOF token. The part between { and } is OCaml source code that is executed after the parsing of the definition is done. When parser does it’s job it raises End_of_file exception which we’ll be handling as end of parsing. So basically when parser sees BEGIN token it just executes generate_begin function which we’ll define shortly. Same with END token only now we are executing generate_end function instead. Statements definition is recursive and just references a statement followed by semicolon followed by more statements or none.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Now let’s write some code to generate the assembly for the begin and end statements. We’ll put all our code generation functions into a separate module called codegen so let’s create a new file called codegen.ml and add some code generation methods into it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Now for parsing expression we need some way to count the depth of the expression for temporary variables so we’ll be just incrementing it as we go deeper and then reset it to 0 after the statement is over. This approach is not optimized one as it would mean some extra stack usage for same depth AST nodes however for sake of simplicity we’ll leave it as it is. So now our parser.mly header should look like this.
1 2 3 4 5 6 7 8 9 |
|
So let’s define description of statement in parser
1 2 3 4 5 |
|
Let’s start with simpler read statement. Now as you can see for read statement we are using identifiers only so we don’t need to do anything with depth of expression so we aren’t calling depth_reset function. Now in order to get identifier list we’ll be just parsing identifiers and collecting them into the list.
1 2 3 4 |
|
Now let’s generate some assembly for reading data into identifiers. First let’s start with some operation helper functions. The read statement is just pretty straightforward and just generates a scanf function call for every identifier.
1 2 3 4 5 6 7 8 9 10 11 |
|
As you can see we are using var_addr function which we didn’t defined yet, this function will give us the actual offset address on the stack based on the name of identifier. We also handle __temp identifiers separately as those are just temporary variables which are discarded every time the statement is over, their offset is specified in their name after the __temp part, so for example __temp1 is an temporary variable with offset 1. Let’s write some code that handles all of this. All the named variables that are defined are saved into vars Hashtbl that contains variable name –> stack offset information.
1 2 3 4 5 6 7 8 9 10 11 |
|
Now for the write statement instead of identifiers we could have expressions, so let’s define those.
1 2 3 4 |
|
And the code that generates write statements is almost the same as read but instead of calling scanf we’ll be calling printf function.
1 2 3 4 5 6 |
|
Let’s now declare an expression
1 2 3 4 5 6 |
|
For simple cases expression can be just a variable or a literal so let’s write some code that generates assembly for those.
1 2 3 |
|
Let’s now handle addition and substraction expressions.
1 2 3 4 5 6 7 8 9 |
|
As you can see simple cases such as when we have literal from both sides are handled by our parser itself. Now let’s write some more code generation functions for expressions. We’ll be using temporary variables so we need to identify a bottom variable at the stack that will be on the top offset. And from there we’ll use depth of current expression in order to have a unique position for the temporary variable. This part can be written slightly in a different way for example instead of counting depth we’ll be counting the temporary variables and resetting them each time a statement is over. Both approaches are almost the same so we’ll just leave it like that.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
Now the only thing left to do is generate some code for assignment expression and our parser is done.
1 2 3 4 5 6 |
|
And now for the final part we need to write the actual compiling function. It will parse the micro source file specified as an argument and will compile it using nasm and gcc. Let’s create micro.ml file and write some final code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
Phew, writing posts is exhausting but finally our micro compiler is ready! Let’s test it. You need to have nasm and gcc installed on your Linux system in order to run it.
1 2 |
|
And it works! Congratulations! We are done! So what we learned today? Building compiler is actually easier than it seems! Happy compiling everyone!