Introduction to Compilers

A language translator is a program which translates programs from source language into an equivalent program in an object language.

Keywords and phrases:source-language, object-language, syntax-directed, compiler, assembler, linker, loader, parser, scanner, top-down, bottom-up, context-free grammar, regular expressions

Introduction

A computer constructed from actual physical devices is termed an actual computer or hardware computer. From the programming point of view, it is the instruction set of the hardware that defines a machine. An operating system is built on top of a machine to manage access to the machine and to provide additional services. The services provided by the operating system constitute another machine, a virtual machine.

A programming language provides a set of operations. Thus, for example, it is possible to speak of a Java computer or a Haskell computer. For the programmer, the programming language is the computer; the programming language defines a virtual computer. The virtual machine for Simple consists of a data area which contains the association between variables and values and the program which manipulates the data area.

Figure M.N: Simple's Virtual Machine and Runtime Environment

CPU

Program counter

Memory

Code Segment

Data Segment

Figure M.N: C's Virtual machine

CPU

Program counter

Activation record

Stack top

Heap information

Memory

Code Segment

Subroutine₀

...

Subroutine_n

Data segment

Global data

Stack (local data)

Heap

Figure M.N: Nonrecursive language with subroutines

CPU

Program Counter

Subroutine₀	...	Subroutine_n
Code	...	Code
Data	...	Data

Between the programmer's view of the program and the virtual machine provided by the operating system is another virtual machine. It consists of the data structures and algorithms necessary to support the execution of the program. This virtual machine is the run time system of the language. Its complexity may range in size from virtually nothing, as in the case of FORTRAN, to an extremely sophisticated system supporting memory management and inter process communication as in the case of a concurrent programming language like SR. The run time system for Simple as includes the processing unit capable of executing the code and a data area in which the values assigned to variables are accessed through an offset into the data area.

User programs constitute another class of virtual machines.

A language translator is a program which translates programs from source language into an equivalent program in an object language. The source language is usually a high-level programming language and the object language is usually the machine language of an actual computer. From the pragmatic point of view, the translator defines the semantics of the programming language, it transforms operations specified by the syntax into operations of the computational model---in this case, to some virtual machine. Context-free grammars are used in the construction of language translators. Since the translation is based on the syntax of the source language, the translation is said to be syntax-directed.

A compiler is a translator whose source language is a high-level language and whose object language is close to the machine language of an actual computer. The typical compiler consists of an analysis phase and a synthesis phase.

In contrast with compilers an interpreter is a program which simulates the execution of programs written in a source language. Interpreters may be used either at the source program level or an interpreter may be used it interpret an object code for an idealized machine. This is the case when a compiler generates code for an idealized machine whose architecture more closely resembles the source code.

There are several other types of translators that are often used in conjunction with a compiler to facilitate the execution of programs. An assembler is a translator whose source language (an assembly language) represents a one-to-one transliteration of the object machine code. Some compilers generate assembly code which is then assembled into machine code by an assembler. A loader is a translator whose source and object languages are machine language. The source language programs contain tables of data specifying points in the program which must be modified if the program is to be executed. A link editor takes collections of executable programs and links them together for actual execution. A preprocessor is a translator whose source language is an extended form of some high-level language and whose object language is the standard form of the high-level language.

The typical compiler consists of several phases each of which passes its output to the next phase

The lexical phase (scanner) groups characters into lexical units or tokens. The input to the lexical phase is a character stream. The output is a stream of tokens. Regular expressions are used to define the tokens recognized by a scanner (or lexical analyzer). The scanner is implemented as a finite state machine.
The parser groups tokens into syntactical units. The output of the parser is a parse tree representation of the program. Context-free grammars are used to define the program structure recognized by a parser. The parser is implemented as a push-down automata.
The contextual analysis phase analyzes the parse tree for context-sensitive information often called the static semantics. The output of the contextual analysis phase is an annotated parse tree. Attribute grammars are used to describe the static semantics of a program.
The optimizer applies semantics preserving transformation to the annotated parse tree to simplify the structure of the tree and to facilitate the generation of more efficient code.
The code generator transforms the simplified annotated parse tree into object code using rules which denote the semantics of the source language.
The peep-hole optimizer examines the object code, a few instructions at a time, and attempts to do machine dependent code improvements.

Figure N.1: Traditional Compiler Structure

Source code
(in source language)
|
\/

Symbol Tables

Analysis
(front-end)

Scanner

Parser

Context
checker

Intermediate
code generator

Error Handler

Optimizer

Code Generator

Peep hole
Optimizer

Synthesis
(back-end)

|
\/
Target code
(in target language)

The Scanner

The scanner groups the input stream (of characters) into a stream of tokens (lexeme) and constructs a symbol table which is used later for contextual analysis. The lexemes include

Key words,
identifiers,
operators,
constants: numeric, character, special, and
comments.

The lexical phase (scanner) groups characters into lexical units or tokens. The input to the lexical phase is a character stream. The output is a stream of tokens. Regular expressions are used to define the tokens recognized by a scanner (or lexical analyzer). The scanner is implemented as a finite state machine.

Lex and Flex are tools for generating scanners is C. Flex is a faster version of Lex.

The Parser

The parser groups tokens into syntactical units. The output of the parser is a parse tree representation of the program. Context-free grammars are used to define the program structure recognized by a parser. The parser is implemented as a push-down automata.

Yacc and Bison are tools for generating bottom-up parsers in C. Bison is a faster version of Yacc. Jack is a tool for generating scanners and top-down parsers in Java.

Symbol Tables and Error Handlers

In addition to a data stream passing through the phases of the compiler, additional information acquired during a phase may be needed by a later phase. The symbol table is used to store the names encountered in the source program and relavant attributes. The information in the symbol table is used by the semantic checker when applying the context-senitive rules and by the code generator. The error handler is used to report and recover from errors encountered in the source.

Contextual Checkers

Contextual checkers analyze the parse tree for context-sensitive information often called the static semantics. The output of the semantic analysis phase is an annotated parse tree. Attribute grammars are used to describe the static semantics of a program.

This phase is often combined with the paser. During the parse, information concerning variables and other objects is stored in a symbol table. The information is utilized to perform the context-sensitive checking.

Intermediate Code Generator

The data structure passed between the analysis and synthesis phases is called the intermediate representation (IR)of the program. A well designed intermediate representation facilitates the independence of the analysis and syntheses (front- and back-end) phases. Intermedate representations may be

assembly language like or
be an abstract syntax tree.

Code Optimizer

Restructuring the parse tree to reduce its size or to present an equivalent tree from which the code generator can produce more efficient code is called optimization.

It may be possible to restructure the parse tree to reduce its size or to present a parse to the code generator from which the code generator is able to produce more efficient code. Some optimizations that can be applied to the parse tree are illustrated using source code rather than the parse tree.

Constant folding

      I := 4 + J - 5;  --> I := J - 1;
     or
      I := 3; J := I + 2;  --> I := 3; J := 5

Loop-Constant code motion

From:
       while (count < limit) do
          INPUT SALES;
          VALUE := SALES * ( MARK_UP + TAX );
          OUTPUT := VALUE;
          COUNT := COUNT + 1;
       end;  -->
to:       
       TEMP :=  MARK_UP + TAX;
       while (COUNT < LIMIT) do
          INPUT SALES;
          VALUE := SALES * TEMP;
          OUTPUT := VALUE;
          COUNT := COUNT + 1;
       end;

Induction variable elimination Most program time is spent in the body of loops so loop optimization can result in significant performance improvement. Often the induction variable of a for loop is used only within the loop. In this case, the induction variable may be stored in a register rather than in memory. And when the induction variable of a for loop is referenced only as an array subscript, it may be initialized to the initial address of the array and incremented by only used for address calculation. In such cases, its initial value may be set

From:
       For I := 1 to 10 do
          A[I] := A[I] + E
to:
       For I := address of first element in A 
             to address of last element in A 
             increment by size of an element of A do
          A[I] := A[I] + E

Common subexpression elimination

From:
      A := 6 * (B+C);
      D := 3 + 7 * (B+C);
      E := A * (B+C);
to:     
      TEMP := B + C;
      A    := 6 * TEMP;
      D    := 3 * 7 * TEMP; 
      E    := A * TEMP;

Strength reduction

     2*x  --> x + x
     2*x  --> shift left x

Mathematical identities

      a*b + a*c --> a*(b+c)
      a - b --> a + ( - b )

We do not illustrate an optimizer in the parser for Simp.

Code Generator

The code generator transforms the intermediate representation into object code using rules which denote the semantics of the source language. These rules are define a translation semantics.

The code generator's task is to translate the intermediate representation to the native code of the target machine. The native code may be an actual executable binary, assembly code or another high-level language. Producing low-level code requires familiarity with such machine level issues such as

data handling
machine instruction syntax
variable allocation
program layout
registers
instruction set

The code generator may be integrated with the parser.

As the source program is processed, it is converted to an internal form. The internal representation in the example is that of an implicit parse tree. Other internal forms may be used which resemble assembly code. The internal form is translated by the code generator into object code. Typically, the object code is a program for a virtual machine. The virtual machine chosen for Simp consists of three segments. A data segment, a code segment and an expression stack.

The data segment contains the values associated with the variables. Each variable is assigned to a location which holds the associated value. Thus, part of the activity of code generation is to associate an address with each variable. The code segment consists of a sequence of operations. Program constants are incorporated in the code segment since their values do not change. The expression stack is a stack which is used to hold intermediate values in the evaluation of expressions. The presence of the expression stack indicates that the virtual machine for Simp is a ``stack machine''.

Declaration translation

Declarations define an environment. To reserve space for the data values, the DATA instruction is used.

integer x,y,z.         DATA 2

Statement translation

The assignment, if, while, read and write statements are translated as follows:

Assignment	x := expr		code for expr STORE X
Conditional	if C then S1 else S2 end	L1: L2:	code for C BR_FALSE L1 code for S1 BR L2 code for S2
While-do	while C do S	L1: L2:	code for C BR_FALSE L2 code for S BR L1
Input	read X		IN_INT X
Output	write expr		code for expr OUT_INT

If the code is placed in an array, then the label addresses must be back-patched into the code when they become available.

Expression translation

Expressions are evaluated on an expression stack. Expressions are translated as follows:

constant LD_INT constant
variable LD variable
e1 op e2 code for e1
code for e2
code for op

Peephole Optimizer

Peephole optimizers scan small segments of the target code for standard replacement patterns of inefficient instruction sequences. The peephole optimizer produces machine dependent code improvements.

Figure N.1 contains a context-free grammar for a simple imperative programming language. It will be used to illustrate the concepts in this chapter.

Figure N.2: Context-free grammar for Simple

program ::= LET definitions IN command_sequence END

definitions ::= e | INTEGER id_seq IDENTIFIER .

id_seq ::= e | id_seq IDENTIFIER ,

command_sequence ::= e | command_sequence command ;

command := SKIP 
        |  READ IDENTIFIER
        |  WRITE exp
        |  IDENTIFIER := exp
        |  IF exp THEN command_sequence ELSE command_sequence FI 
        |  WHILE bool_exp DO command_sequence END 

exp ::=  exp + term | exp - term  | term
term :: term * factor | term / factor | factor 
factor ::= factor^primary  | primary 
primary ::=  NUMBER | IDENT | ( exp ) 
bool_exp ::=  exp = exp | exp < exp | exp > exp

Systematic development of a recursive descent parser

A parser groups sequences of tokens into larger meaningful units described by a context-free grammar. The parser takes as input a stream of tokens where each token contains both the class and spelling of a token. The stream of tokens is processed sequentially and currentToken contains the token of immediate interest. The output of the parser is a syntax tree. The tree may or may not be built explicitly.

There are four steps in the systematic construction of a recursive descent parser.

Transform the grammar into proper form.
Determine the sets First[E] and Follow[N] for each right-hand side E and non-terminal N of the grammar.
Construct parsing procedures from the grammar.
Construct the parser.

Figure N.1 summarizes the grammar transformation rules.

Figure N.3: Grammar Transformation Rules

Convert the grammar to EBNF

Remove left-recursion: replace N ::= E | NF with N ::= E(F)*

Left-factor the grammar: replace N ::= EFG | EF'G with N ::= E(F|F')G

If N ::= E is not recursive, remove it and replace all occurrences of N in the grammar with E

First the grammar is converted to EBNF. The resulting grammar must have a single production rule for each non-terminal symbol. Next, rules containing left recursion are transformed to rules which do not contain left recursion. Left recursion occurs when the same non-terminal appears both at the head of the rule and as a left-most symbol on the right-hand side. The parser can enter an infinite loop if this transformation is not done. Mutual recursion must also be eliminated but it is more difficult. Next, the grammar is simplified by replacing non-terminals with their defining body. This should be done bottom up, stopping when recursion is encountered. Finally, simplify the grammar by factoring the right-hand sides. This makes it easier for the parser to select the correct grammar rule.

The first and follow sets are used by the parser to select the applicable grammar rule. Figure N.2 summarizes the rules for computing the First and Follow sets.

Figure N.2: First[E] and Follow[N]

First[e] = empty set

First[t] = {t} t is a terminal

First[N] = First[E] where N ::= E

First[E F] = First[E] union First[F] if E generates lambda

= First[E] otherwise

First[E|F] = First[E] union First[F]

First[E*] = First[E]

Follow[N] = {t} in context Nt, t is terminal

= First[F] in context NF, F is non-terminal

The First[E] is the set of terminal symbols that can start a string generated by E. The Follow[N] is the set of terminal symbols that can appear in strings that follow those strings generated by N. The importance of the first and follow sets becomes apparent when the grammar rules are converted to parsing procedures.

Figure N.3 summarizes the rules for converting the EBNF grammar to a collection of parsing procedures.

Figure N.3: EBNF to Parsing Procedures
For each grammar rule N::=E, construct a parsing procedure
parseN {
    parse E
}
Refine parse E

If parse E is: then refine to:

parse lambda skip

parse t accept(t) where t is a terminal

parse N parseN where N is a non-terminal

parse E F parse E; parse F

parse E|F if currentToken.class in First[E] then
parse E
else if currentToken.class in First[F] then
parse F
else report a syntactic error

parse E* while currentToken.class in First[E] do
parse E

If parse E is:	then refine to:
parse lambda	`skip`
parse t	`accept(t)` where t is a terminal
parse N	`parseN` where N is a non-terminal
parse E F	parse E; parse F
parse E\|F	`if currentToken.class in First[E] then` `parse E` `else if currentToken.class in First[F] then` `parse F` `else report a syntactic error`
parse E*	`while currentToken.class in First[E] do` `parse E`

If parse E is parse lambda (recall lambda is the empty string), then parse E is the skip command. If parse E is parse t (where t is a terminal symbol), then parse E is accept(t). If the current token is known to be t, then acceptIt. If parse E is parse N (where N is a non-terminal), then parse E is the call parseN. If parse E is parse E F, then parse E is{parse E; parse F}. If parse E is parse E|F, then parse E is

if currentToken.class in First[E] then
        parse E
else if currentToken.class in First[F] then
        parse F
else
        report a syntactic error

where First[E] and First[F] are disjoint. If parse E is parse E*, then parse E is

while currentToken.class in First[E] do
        parse E

where First[E] is disjoint from Follow[E*]

The parser consists of:

a global variable currentToken;
auxiliary procedures

scanToken obtains the next token from the scanner
accept(tc) which obtains the next token from the scanner if the current token is of the class tc else returns a syntactic error. In some instances, the current token is known and then a simplified procedure acceptIt may be used. It obtains the next token from the scanner.

the parsing procedures developed from the grammar;
a driver parse that calls parseS (where S is the start symbol of the grammar) after having called the scanner to store the first input token in currentToken;

parse() {
   getChar;
   scanToken;
   parseS;
}

Systematic development of a table-driven parser

Given a grammar which satisfies the restrictions specified in the recursive descent parser construction, a table-driven parser may be constructed using the top-down parsing algorithm.

Systematic development of a scanner

A scanner groups sequences of charactors into tokens described by a regular grammar. The scanner takes as input a stream of charactors. The stream of characters is processed sequentially and currentChar contains the character of immediate interest. The characters defining a token are collected into a string and the class of the token is identified. The output of the scanner is a stream of tokens. Each token contains information concerning its class and spelling.

There are three steps in the systematic construction of a scanner.

Transform the regular expressions into an EBNF grammar.
Transcribe each EBNF production rule N ::= E to a scanning procedure scanN, whose body is determined by E.
Construct the scanner.

Figure N.M summarizes the rules for tranforming the regular expressions to and EBNF grammar.

Figure N.M: RE to EBNF

Each regular expression RE_i defining a token class T_i is put into the EBNF form: T_i ::= RE_i.

A regular expression Sep is constructed defining the symbols which sparate tokens.

The EBNF production S ::= Sep*(T₀|...|T_n) is added to the grammar.

For each regular expression RE defining a token T, the EBNF rule T ::= RE. A regular expression sep* defining the strings that separate tokens is constructed. And the EBNF production S ::= Sep*(T₀|...|T_n) is defined.

Figure N.3: EBNF to Scanning Procedures

For each grammar rule T_i::=E_i, construct a scanning procedure scanT_i {scan E_i}.

Refine scan E_i

scan E_i Refinement

scan lambda skip

scan ch takeIt(t) where ch is a character

scan N scanN where N is a non-terminal

scan E F scan E; scan F

scan E|F if currentChar in First[E] then
scan E
else if currentChar in First[F] then
scan F
else report a syntactic error

scan E* while currentChar in First[E] do
scan E

scan E_i	Refinement
scan lambda	`skip`
scan ch	takeIt(t) where ch is a character
scan N	`scanN` where N is a non-terminal
scan E F	scan E; scan F
scan E\|F	`if currentChar in First[E] then` `scan E` `else if currentChar in First[F] then` `scan F` `else report a syntactic error`
scan E*	`while currentChar in First[E] do` `scan E`

The scanner is developed from an EBNF grammar (must be non-self embedding) as follows:

Objects

currentChar contains the current character.
currentToken contains the current token, its spelling and its class.

Convert the grammar to EBNF with a single production rule for each non-terminal symbol.
The scanner consists of the procedures developed in step (2) enhanced to record the token's class and spelling;
a procedure scanToken that scans 'separator*Token', and sets currentToken.spelling to the charactor string scanned and currentToken.class token.
the auxiliary procedures

start sets currentToken.spelling to the empty string.
getChar appends currentChar to currentToken.spelling and fetches the next character into currentChar.
finish sets currentToken.class to the identified class (used for simple disjoint classes)
screen sets currentToken.class to the identified class (used for complex classes that require additional analysis to determine class).

If currentChar is part of currentToken which is under construction, the procedure takeIt adds currentChar to currentToken and If currentChar is not part of currentToken which is under construction, the procedure leaveIt adds currentChar to currentToken.

Attribute Grammars and Contextual Constraints

Context-free grammars are not able to completely specify the structure of programming languages. For example, declaration of names before reference, number and type of parameters in procedures and functions, the correspondence between formal and actual parameters, name or structural equivalence, scope rules, and the distinction between identifiers and reserved words are all structural aspects of programming languages which cannot be specified using context-free grammars. These context-sensitive aspects of the grammar are often called the static semantics of the language. The term dynamic semantics is used to refer to semantics proper, that is, the relationship between the syntax and the computational model. Even in a simple language like Simp, context-free grammars are unable to specify that variables appearing in expressions must have an assigned value. Context-free descriptions of syntax are supplemented with natural language descriptions of the static semantics or are extended to become attribute grammars.

Attribute grammars are an extension of context-free grammars which permit the specification of context-sensitive properties of programming languages. Attribute grammars are actually much more powerful and are fully capable of specifying the semantics of programming languages as well.

For an example, the following partial syntax of an imperative programming language requires the declaration of variables before reference to the variables.

P ::= D B 
D ::= V... 
B ::= C ... 
C ::= V := E $|$ ...

However, this context-free syntax does not indicate this restriction. The declarations define an environment in which the body of the program executes. Attribute grammars permit the explicit description of the environment and its interaction with the body of the program.

Since there is no generally accepted notation for attribute grammars, attribute grammars will be represented as context-free grammars which permit the parameterization of non-terminals and the addition of where statements which provide further restrictions on the parameters. Figure~\ref{ag:decl} is an attribute grammar for declarations.

Figure : An attribute grammar for declarations

P ::= D(SymbolTable) B(SymbolTable)
D(SymbolTable)  ::= ...V( insert( V in SymbolTable)...
   
B(SymbolTable) ::= C(SymbolTable)... 
C(SymbolTable) ::= V := E(SymbolTable, Error(if V not in SymbolTable)
                   | ...

The parameters marked with $\downarrow$ are called inherited attributes and denote attributes which are passed down the parse tree while the parameters marked with $\uparrow$ are called synthesized attributes and denote attributes which are passed up the parse tree. Attribute grammars have considerable expressive power beyond there use to specify context sensitive portions of the syntax and may be used to specify:

context sensitive rules
evaluation of expressions
translation

Historical Perspectives and Further Reading

For information on compiler construction using Lex and Yacc see\cite{SchFre85}. Pratt \cite{Pratt84} emphasizes virtual machines. ELI, PCCTS, FLEX/BISON, LEX/YACC, Amsterdam Compiler Kit, Jack

Exercises

(translation) Construct a translation semantics for

Simple
HTML to TeX/LaTeX
TeX/LaTeX to HTML

Construct a scanner and a parser for expressions (use a grammar from chapter 2)
Construct an attribute grammar for expressions
Construct a calculator using the attribute grammar for expressions.
Construct a scanner for Simple
Construct a parser for Simple
Construct a code generator for Simple
Construct an interpreter for Simple
Construct an interpreter for BASIC.

by Anthony A. Aaby. Send comments to aabyan@wwc.edu

First[e]	=	empty set
First[t]	=	{t}	t is a terminal
First[N]	=	First[E]	where N ::= E
First[E F]	=	First[E] union First[F]	if E generates lambda
	=	First[E]	otherwise
First[E\|F]	=	First[E] union First[F]
First[E*]	=	First[E]
Follow[N]	=	{t}	in context Nt, t is terminal
	=	First[F]	in context NF, F is non-terminal