At one point or another every single software developer in the world comes to a realization in his career when the time is ripe and it’s time
to write your own super cool programming language.
However the subject of creating your own programming language with an compiler is quite a complex one and can’t be tackled without some pre-research.
That’s how I’ve started reading Crafting Compiler in C, an aged but
really comprehensive book about developing your own compiler for an Ada-like programming language.
Second chapter describes writing a really simple micro language targeting pseudo assembly-like output in order to explain the core concepts of developing your
own compiler and writing an LL(1) parser.
Let’s try rewriting this micro compiler in OCaml, a language better suited for writing compilers that is becoming quite popular due to it’s clean syntax and strict evaluation semantics combined
with functional and object-oriented programming styles. If you are not familiar with OCaml try reading Real World OCaml first.
Instead of outputting pseudo assembly our micro compiler will output a real nasm source code which will be automatically compiled into a binary executable file.
So let’s start by describing simple micro language with an example source code
12345678
begin
a := 1;
b := a + 1;
b := b + 1;
write (a,b);
read(a,b);
write (a+10, b+10);
end
As you can see from example source code the program starts with begin keyword and ends with an end keyword. It has only integer variables which must be predefined by assignment operation before using in expressions, and it also has two simple functions read and write.
read takes a list of variable names separated by comma and reads user input from stdin into those variables
write takes a list of expressions and outputs them into stdout
Now in order to create an executable from this source code first we need to parse it. Since LL(1) type parser is enough to parse this kind of language, we’ll need only one character lookahead.
Unfortunately OCaml doesn’t have an unread operation like libc’s ungetc so we’ll need to define a simple stream reader which will have a mutable char and we will also count lines of source code read.
We’ll also define two utility functions which we will use later which will check if a character is alphanumeric and if it’s a digit.
1234567891011121314151617181920
(* stream *)typestream={mutablechr:charoption;mutableline_num:int;chan:in_channel}letopen_streamfile={chr=None;line_num=1;chan=open_infile}letclose_streamstm=close_instm.chanletread_charstm=matchstm.chrwithNone->letc=input_charstm.chaninifc='\n'thenlet_=stm.line_num<-stm.line_num+1incelsec|Somec->stm.chr<-None;cletunread_charstmc=stm.chr<-Somec(* character *)letis_digitc=letcode=Char.codecincode>=Char.code('0')&&code<=Char.code('9')letis_alphac=letcode=Char.codecin(code>=Char.code('A')&&code<=Char.code('Z'))||(code>=Char.code('a')&&code<=Char.code('z'))
Now for our parser we will be parsing source code one token at a time and we’ll be recursively calling parsing methods and matching tokens as we go.
We’ll define some additional utility functions which will be used during parsing including an Syntax_error exception which will be thrown if invalid
token is scanned.
scan will scan the stream for next token, that’s where our token recognition logic is in.
skip_blank_chars function will skip any number of blank characters including new line characters
next_token will return last scanned token or will scan stream for next token
match_next will match last scanned token or will scan stream for next token, match it and return it
match_token will match the last scanned token against the specified token in parameter or will scan stream for next token and match against it
(* token *)typetoken=Begin|End|Identifierofstring|Read|Write|Literalofint|Assign|LeftParen|RightParen|AddOp|SubOp|Comma|Semicolontypescanner={mutablelast_token:tokenoption;stm:stream}exceptionSyntax_errorofstringletsyntax_errorsmsg=raise(Syntax_error(msg^" on line "^(string_of_ints.stm.line_num)))(* skip all blank and new line characters *)letrecskip_blank_charsstm=letc=read_charstminifc=' '||c='\t'||c='\r'||c='\n'thenskip_blank_charsstmelseunread_charstmc;()(* scan a stream and return next token *)letscans=letstm=s.stminletc=read_charstminletrecscan_idenacc=letnc=read_charstminifis_alphanc||is_digitnc||nc='_'thenscan_iden(acc^(Char.escapednc))elselet_=unread_charstmncinletlc=String.lowercaseacciniflc="begin"thenBeginelseiflc="end"thenEndelseiflc="read"thenReadelseiflc="write"thenWriteelseIdentifieraccinletrecscan_litacc=letnc=read_charstminifis_digitncthenscan_lit(acc^(Char.escapednc))elselet_=unread_charstmncinLiteral(int_of_stringacc)inifis_alphacthenscan_iden(Char.escapedc)elseifis_digitcthenscan_lit(Char.escapedc)elseifc='+'thenAddOpelseifc='-'thenSubOpelseifc=','thenCommaelseifc=';'thenSemicolonelseifc='('thenLeftParenelseifc=')'thenRightParenelseifc=':'&&read_charstm='='thenAssignelsesyntax_errors"couldn't identify the token"letnew_scannerstm={last_token=None;stm=stm}letmatch_nexts=matchs.last_tokenwithNone->let_=skip_blank_charss.stminscans|Sometn->s.last_token<-None;tnletmatch_tokenst=match_nexts=tletnext_tokens=matchs.last_tokenwithNone->(skip_blank_charss.stm;lett=scansins.last_token<-Somet;t)|Somet->t
In order to generate an asm output we’ll define an generator type which will contain the output channel
and variable locations Hashtbl. Each variable’s location will be defined as an integer offset from esp
and since our micro language is simple we won’t have to handle advanced aspects of variable scope and stack
handling.
We’ll also need to distinguish between ordinary variables and an temporary ones.
Our temporary variables will be defined automatically and will start from __temp following
variable location offset. This way we won’t be storing temporary variables in Hashtbl and will be
computing their offset from their name. We’ll also define some additional generator helper functions to output asm code.
Now in order to compile a source code we need to create a new generator, open an stream on a source file, parse it,
compile the asm source code using nasm and link the generated .o file into an elf executable.
Our parsing method will be combined with semantics checking and will output asm code using generator functions which we will define later.
Program begins from begin keyword, ends with end keyword and has 0 or more statements in between.
123456789101112131415
letparsestmg=lets=(new_scannerstm)intryprogramsgwithEnd_of_file->syntax_errors"program reached end of file before end keyword"letprogramsg=ifmatch_tokensBeginthenlet_=generate_beginsginlet_=statementssginifmatch_tokensEndthenlet_=generate_endsgin()elsesyntax_errors"program should end with end keyword"elsesyntax_errors"program should start with begin keyword"letrecstatementssg=ifstatementsgthenstatementssgelse()
Each statement is either an read, write or an assignment to a variable.
123456789101112
letrecstatementssg=ifstatementsgthenstatementssgelse()letstatementsg=lett=next_tokensinifmatchtwithRead->readsg|Write->writesg|Identifieri->assignmentsg|_->falsethenifmatch_tokensSemicolonthentrueelsesyntax_errors"statement must end with semicolon"elsefalse
Each assignment statement has an identifier token on it’s left side followed by an assignment token and expression
on the right hand side.
1234567891011
letassignmentsg=letid=match_nextsinmatchidwithIdentifieri->(ifmatch_tokensAssignthenletnew_var=ifis_alloc_varsgithen0else1inletid2=expressionsg(1+new_var)inmatchid2withLiterall2->let_=generate_assignsgidid2intrue|Identifieri2->let_=generate_assignsgidid2intrue|_->syntax_errors"literal or identifier expected"elsesyntax_errors"assign symbol expected")|_->syntax_errors"identifier expected"
Each expression statement is primary optionally followed by an operation token and another primary.
Primary might also be an expression in curly brackets.
12345678910111213141516171819
letrecexpressionsgd=letprimarys=match(next_tokens)withLeftParen->(let_=match_tokensLeftPareninlete=expressionsg(d+1)inifmatch_tokensRightParenthenSomeeelsesyntax_errors"right paren expected in expression")|Identifieri->let_=match_tokens(Identifieri)inSome(Identifieri)|Literall->let_=match_tokens(Literall)inSome(Literall)|_->Noneinletlp=primarysinmatchlpwithSomel->(match(next_tokens)withAddOp->let_=match_tokensAddOpinaddopsgdl(expressionsg(d+1))|SubOp->let_=match_tokensSubOpinsubopsgdl(expressionsg(d+1))|_->l)|None->syntax_errors"literal or identifier expected"
Our micro language supports only two operations on integers, addition and subtraction, but
it can be easily extended to support more.
12345678910
letaddopsgdlr=match(l,r)with(Literall1,Literall2)->Literal(l1+l2)|(Identifieri1,Literall2)->generate_addsgdlr|(Literall1,Identifieri2)->generate_addsgdrl|_->syntax_errors"expected literal or identifier for add operation"letsubopsgdlr=match(l,r)with(Literall1,Literall2)->Literal(l1-l2)|(Identifieri1,Literall2)->generate_subsgdlr|(Literall1,Identifieri2)->generate_subsgdlr|_->syntax_errors"expected literal or identifier for sub operation"
write statement is just comma separated list of expressions.
12345678910111213141516171819
letwritesg=letrecexpressionsc=lete=(expressionsg1)inifmatchewithIdentifier_->let_=generate_writesgeintrue|Literal_->let_=generate_writesgeintrue|_->falsethenif(next_tokens)=Commathenlet_=match_tokensCommainexpressions(c+1)else(c+1)elsecinifmatch_tokensWritethenifmatch_tokensLeftParenthenifexpressions0>0thenifmatch_tokensRightParenthentrueelsesyntax_errors"right paren expected in write statement"elsesyntax_errors"write statement expected atleast one expression"elsesyntax_errors"left paren expected in write statement"elsesyntax_errors"write statement expected"
read statement is a comma separated list of identifiers
1234567891011121314151617
letreadsg=ifmatch_tokensReadthenifmatch_tokensLeftParenthenletids=identifierssinifids=[]thensyntax_errors"read statement expects comma seperated identifier(s)"elseifmatch_tokensRightParenthenlet_=generate_readssg(List.revids)intrueelsesyntax_errors"right paren expected in read statement"elsesyntax_errors"left paren expected in read statement"elsesyntax_errors"read statement expected"letidentifierss=letrecidensids=match(next_tokens)withIdentifieri->let_=match_nextsinletn=next_tokensinifn=Commathenlet_=match_tokensCommainidens(Identifieri::ids)elseidens(Identifieri::ids)|_->idsinidens[]
Now it’s finally time to generate some asm code, let’s start from begin and end of our program.
Our program will preallocate 1000 bytes on stack for variables, since all of our variables are static.
We’ll also need to define external libc functions scanf and printf.
1234567891011121314151617181920
letgenerate_begin_g=geng"extern printfextern scanfsection .data inf: db '%d', 0 ouf: db '%d', 10, 0section .text global mainmain: sub esp, 1000"letgenerate_end_g=geng" add esp, 1000exit: mov eax, 1 ; sys_exit mov ebx, 0 int 80h"
read and write statements will use libc scanf and printf functions to read integer variables
from stdin and output them on stdout.
12345678910111213141516
letgenerate_readsgid=matchidwithIdentifieri->(op2g"lea""eax"(var_addrsgi);pushg"eax";pushg"inf";opg"call""scanf";op2g"add ""esp""8")|_->syntax_errors"generate read called with invalid argument"letrecgenerate_readssg=List.iter(generate_readsg)letgenerate_writesgid=matchidwithIdentifieri->(pushg(varsgi);pushg"ouf";opg"call""printf";op2g"add ""esp""8")|_->syntax_errors"generate write called with invalid argument"
Assignment statement is just an allocation of a variable followed by a copying variable from one location
into another. Our copy function understands literal variables and translates their values directly into asm code.
1234567891011
letgenerate_assignsgab=matchawithIdentifieri->let_=alloc_varsgiingenerate_copysgab|_->syntax_errors"generate assign called with invalid argument"letgenerate_copysgab=matchawithIdentifieri->(matchbwithIdentifieri2->(op2g"mov ""eax"(varsgi2);op2g"mov "(varsgi)"eax")|Literall->op2g"mov "(varsgi)(string_of_intl)|_->syntax_errors"generate copy called with invalid argument")|_->syntax_errors"generate copy called with invalid argument"
Addition and subtraction operations use temporary variables to add and subtract
values from two variables and return result
12345678910111213141516171819202122232425
letgenerate_addsgdid1id2=match(id1,id2)with(Identifieri1,Identifieri2)->(letv=temp_varsgdinletvi=token_varsgvinlet_=generate_copysgvid1inlet_=op2g"add "vi(varsgi2)inv)|(Identifieri1,Literall2)->(letv=temp_varsgdinletvi=token_varsgvinlet_=generate_copysgvid1inlet_=op2g"add "vi(string_of_intl2)inv)|_->syntax_errors"generate exp called with invalid argument"letgenerate_subsgdid1id2=match(id1,id2)with(Identifieri1,Identifieri2)->(letv=temp_varsgdinletvi=token_varsgvinlet_=generate_copysgvid1inlet_=op2g"sub "vi(varsgi2)inv)|(Identifieri1,Literall2)->(letv=temp_varsgdinletvi=token_varsgvinlet_=generate_copysgvid1inlet_=op2g"sub "vi(string_of_intl2)inv)|(Literall1,Identifieri2)->(letv=temp_varsgdinletvi=token_varsgvinlet_=generate_copysgvid1inlet_=op2g"sub "vi(varsgi2)inv)|_->syntax_errors"generate exp called with invalid argument"
And that’s all of it! We now have a trivial micro compiler that generates binary executable file.
Source code can be found in my github micro repo. Writing an compiler
is quite complex and entertaining task but it’s definitely worth the time spend on!