Romain Semler | Compilation

Compiler

A compiler realized with the C language

A compiler for the L language

During the compilation courses at the faculty, we were asked to create a compiler with the C language intended to read and interpret a language invented for the project : the L language.

The compiler itself

A compiler is a computer program that reads another program written in a source language (here the L language) and translates it into a target language, in order to be able to execute the resulting program or algorithm. It also reports any errors in the source language.

A language, in computer programming, consists above all of specific keywords and syntactic rules to be respected. It is therefore obviously necessary for the compiler to know the complete lexicon of the L language and the syntax for writing the code.

The analyzers

That is why our compiler has two analyzers : a lexical analyzer and a syntactic analyzer. The lexical analyzer will read sequences of characters called lexemes. So for each group of characters read, the analyzer will associate a data type to it (via identifiers such as "number", "variable identifier", "function identifier"...) and then retrieve these characters read (thus the word read). The lexical analyzer does not look if the code is written correctly or not. It is simply necessary that the words read by the analyzer be present in the lexicon of the language. The syntactic analyzer is more complex and will process the order of the elements of the source language to check if the syntax of the language is respected (for example, if the word read is an opening parenthesis, it must be ensured that there is a closing parenthesis following it).

The grammar

To allow the analyzer to know the syntax of the language, and therefore to exploit it, a grammar specific to the L language is used. In computing, a grammar is a set of syntactic constraints. These constraints are illustrated by rewriting rules (or production rules) in the following form : A -> BC | DE (in this case it means: "A begets B followed by C OR begets D followed by E"). The L grammar has more than 60 rewriting rules. And it is these rules that the syntactic analyzer must know in order to function correctly. Thus, for the analyzer to be in possession of these rules, they must be programmed in the analyzer itself (obviously without making any errors). And for that, it is necessary to respect a certain recurring structure for each rule.
Here is a sample code that checks the grammar of the "WHILE" loop structure :

void instructionTantque() {
    affiche_balise_ouvrante("instructionTantque", 1);
    if (uniteCourante == TANTQUE) {
        afficher_element(uniteCourante);
        uniteCourante = yylex();
        expression();
        if (uniteCourante == FAIRE) {
            afficher_element(uniteCourante);
            uniteCourante = yylex();
            instructionBloc();
        } else {
            erreur("erreur de syntaxe -> \'FAIRE\' attendu.");
        }
    } else {
        erreur("erreur de syntaxe -> \'TANTQUE\' attendu.");
    }
        affiche_balise_fermante("instructionTantque", 1);
}

The "uniteCourante" is a variable that contains the last character or the last keyword read. That's why we test several conditions with it. If it is equal to the keyword "TANTQUE" (While) we make it read the next word, and so on. It thus follows an order of keywords corresponding to the language syntax. If the order and therefore the syntax keywords are respected, the statement is correct (and another structure will handle the run). Otherwise, it is in the case "error" indicating that a syntax error was found.

Explanations stop here to avoid complexity. However, our compiler code is available via GitHub (link below).

Informations about the project

Title	Compilator
Description	Compilateur pour le langage L.
Used language	C
Year of work	2016
Access to code	Link