Syntax

The syntax of a programming language describes the structure of programs without any consideration of their meaning.

Keywords and phrases: Regular expression, regular grammar, context-free grammar, parse tree, ambiguity, BNF, context sensitivity, attribute grammar, inherited and synthesized attributes, scanner, lexical analysis, parser, static semantics.

Syntax is concerned with the structure of programs and layout with their appearance. The syntactic elements of a programming language are determined by the computation model and pragmatic concerns. There are well developed tools (regular, context-free and attribute grammars) for the description of the syntax of programming languages. Grammars are rewriting rules and may be used for both recognition and generation of programs. Grammars are independent of computational models and are useful for the description of the structure of languages in general.

Context-free grammars are used to describe the bulk of the language's structure; regular expressions are used to describe the lexical units (tokens); attribute grammars are used to describe the context sensitive portions of the language. Attribute grammars are described in a later chapter.

Alphabets and Languages

A language is formally defined by its words and its sentence structure. In this section, we develop the basic notions of words and sentences as strings of words. The collection of words from which sentences are constructed is called an alphabet and a language is a collection of strings formed from the elements of the alphabet. A string with no words from the alphabet is called the empty string and is denoted by lambda. Definition 2.1 formalizes these notions.

Definition 2.1: Alphabet and Language

Sigma

An alphabet Sigma is a nonempty, finite set of symbols.

L

A language L over an alphabet Sigma is a collection of strings of elements of Sigma. The empty string lambda is a string with no symbols at all.

Sigma^*

The set of all possible finite strings of elements of Sigma is denoted by Sigma^*. Lambda is an element of Sigma^*.

A string is a finite sequence of symbols from an alphabet, Sigma. The concatenation of two strings v and w is the string wv obtained by appending the string w to the right of string v.

Programming languages require two levels of description, the lowest level is that of a token. The tokens of a programming language are the keywords, identifiers, constants and other symbols appearing in the language. In the program

void main()
{
  printf("Hello World\n");
}

the tokens are

void, main, (, ), {, printf, (, "Hello World\n", ), ;, }

The alphabet for the language of the lexical tokens is the character set while the alphabet for a programming language is the set of lexical tokens;

A string in a language L is called a sentence and is an element of Sigma^*. Thus a language L is a subset of Sigma^*. Sigma⁺ is the set of all possible nonempty strings of Sigma, so Sigma⁺ = Sigma^* - { lambda }. A token is a sentence in the language for tokens and a program is a sentence in the language of programs.

If L₀ and L₁ are languages, then L₀L₁ denotes the language {xy | x is in L₀, and y is in L₁ }. That is L₀L₁ consists of all possible concatenations of a string from L₀ followed by a string from L₁.

Grammars and Languages

The ordering of symbols within a token are described by regular expressions. The ordering of symbols within a program are described by context-free grammars. In this section, we describe context-free grammars. A later section describes regular expressions. Context-free grammars describe how lexical units (tokens) are grouped into meaningful structures. The alphabet (the set of lexical units) consists of the keywords, identifiers, constants, punctuation symbols, and various operators. While context-free grammars are sufficient to describe most programming language constructs, they cannot specify context-sensitive aspects of a language such as the requirements that a name must be declared before it is referenced, the order and number of actual parameters in a procedure call must match the order and number of formal arguments in a procedure declaration, and that types must be compatible on both sides of an assignment operator.

A grammar consists of a finite collection of grammatical categories (e.g. noun phrase, verb phrase, article, noun, verb etc), individual words (elements of the alphabet), rules for describing the order in which elements of the grammatical categories must appear and there must be a most general grammatical category. Figure 2.1 contains a context-free grammar for a fragment of English.

Figure 2.1: G₀a grammar for a fragment of English
The grammatical categories are: S, NP, VP, D, N, V.
The words are: a, the, cat, mouse, ball, boy, girl, ran, bounced, caught.
The grammar rules are:
S  --> NP VP
NP --> N
NP --> D N
VP --> V
VP --> V NP
V  --> ran | bounced | caught
D  --> a | the
N  --> cat | mouse | ball | boy | girl
The most general category is S, a sentence.

In a context-free grammar, the grammatical categories are called variables, the words (tokens) are called terminals, the grammar rules are rewriting rules called productions, and the most general grammatical category is called the start symbol. This terminology is restated in Definition 2.2.

Definition 2.2: Context-free grammar
Context-free grammar G is a quadruple
G = (V, T, P, S)
where
V is a finite set of variable symbols,
T is a finite set of terminal symbols disjoint from V,
P is a finite set of rewriting rules (productions) of the form
A --> w where A in V, w in (V union T)*
S is an element of V called the start symbol.

Grammars may be used to generate the sentences of a language. Given a string w of the form

w = uxv

the production x --> y is applicable to this string since x appears in the string. The production allows us to replace x with y obtaining the string z

z = uyv

and say that w derives z. This is written as

w ==> z

w₁ ==> w₂ ==> ... ==> w_n

we say that w₁ derives w_n and write

w₁ ==>^* w_n

The set of sentences of a language are derived from the start symbol of the grammar. Definition 2.3 formalizes these ideas.

Definition 2.3: Generation of a Language from the Grammar

Let G be a grammar. Then the set L(G) = {w in T^* | S ==>^* w}
is the language generated by G.
A language L is context-free iff there is a context-free grammar G such that L = L(G).
If w in L(G), then the sequence
S ==> w₁ ==> w₂ ==> ... ==> w_n ==> w
is a derivation of the sentence w and the w_i are called sentential forms.

Using the grammar G₀ the sentence the cat caught the mouse can be generated as follows:

S ==> NP VP
  ==> D N VP 
  ==> the N VP 
  ==> the cat VP 
  ==> the cat V NP 
  ==> the cat caught NP 
  ==> the cat caught D N
  ==> the cat caught the N
  ==> the cat caught the mouse

This derivation is performed in a leftmost manner. That is, in each step the leftmost variable in the sentential form is replaced.

Sometimes a derivation is more readable if it is displayed in the form of a derivation tree.

                             S
                            / \
                          NP   VP
                         /\     /\
                        D  N   V  NP
                       /  /    /   \  
                    the cat caught  \
                                    /\
                                   D  N
                                   \   \
                                   the mouse

The notion of a tree based derivation is formalized in Definition 2.5.

Definition 2.5: Derivation Tree

Let G = (V, T, P, S) be a context-free grammar. A derivation tree has the following properties.

The root is labeled S.

Every interior vertex has a label from V.

If a vertex has label A in V, and its children are labeled (from left to right) a₁, ..., a_n, then P must contain a production of the form
A --> a₁...a_n
Every leaf has a label from T union {lambda}.

In the generation example we chose to rewrite the left-most nonterminal first. When there are two or more left-most derivations of a string in a given grammar or, equivalently, there are two distinct derivation trees for the same sentence, the grammar is said to be ambiguous. In some instances, ambiguity may be eliminated by the selection of another grammar for the language or adding rules which may not be context-free rules. Definition 2.6 defines ambiguity in terms of derivation trees.

Definition 2.6: Ambiguous Grammar

A context-free grammar G is said to be ambiguous if there exists some w in L(G) which has two distinct derivation trees.

Abstract Syntax

Programmers and compiler writers need to know the actual symbols used in programs -- the concrete syntax. A grammar defining the concrete syntax of arithmetic expressions is grammar G₁ in Figure 2.2,.

Figure 2.2: G₁ An expression grammar

   V = { E } 
   T = { c, id, +, *, (, ) }
   P = {E --> c,
        E --> id,
        E --> (E),
        E --> E + E,
        E --> E * E }  
   S = E

We assume that c and id stand for any constants and identifiers respectively. Concrete syntax is concerned with the hierarchical relationships and the particular symbols used. The main point of abstract syntax is to omit the details of physical representation, specifying the pure structure of the language by specifying the logical relations between parts of the language. A grammar defining the abstract syntax of arithmetic expressions is grammar G₂ in Figure 2.3.

Figure 2.3: G₂An abstract expression grammar

   V = { E } 
   T = { c, id, add, mult}
   P = {E --> c,
        E --> id,
        E --> add E E ,
        E --> mult E E }  
   S = E

The terminal symbols are names for classes of objects.

An additional difference between concrete and abstract syntax appeThe key difference in the use of concrete and abstract grammars is best illustrated by comparing the derivation tree and the abstract syntax tree for the expression id + (id * id). The derivation tree for the concrete grammar is just what we would expect

                              E
                             /|\
                            E + E
                           /   /|\ 
                         id   ( E ) 
                               /|\ 
                              E * E 
                             /     \
                           id       id

while the abstract syntax tree for the abstract grammar is quite different.

In a derivation tree for an abstract grammar, the internal nodes are labeled with the operator and the the operands are their children and there are no concrete symbols in the tree. Abstract syntax trees are used by compilers for an intermediate representation of the program.

Concrete syntax defines the way programs are written while abstract syntax describes the pure structure of a program by specifying the logical relation between parts of the program. Abstract syntax is important when we are interested in understanding the meaning of a program (its semantics) and when translating a program to machine code.

Parsing

Grammars may be used both for the generation and recognition (parsing) of sentences. Both generation and recognition requires finding a rewriting sequence consisting of applications of the rewriting rules which begins with the grammar's start symbol and ends with the sentence. The recognition of a program in terms of the grammar is called parsing. An algorithm which recognizes programs is called a parser. A parser either implicitly or explicitly builds a derivation tree for the sentence.

There are two approaches to parsing. The parser can begin with the start symbol of the grammar and attempt to generate the same sentence that it is attempting to recognize or it can try to match the input to the right-hand side of the productions building a derivation tree in reverse. The first approach is called top-down parsing and the second, bottom-up parsing.

Figure 2.4 illustrates top-down parsing by displaying both the parse tree and the remaing unrecognized input. The input is scanned from left to right one token at a time.

Figure 2.4: Top-down Parse

PARSE TREE                 UNRECOGNIZED INPUT

     S                     the cat caught the mouse
    /\
   NP VP                   the cat caught the mouse
  / \  \
 D   N  \                  the cat caught the mouse
 |   |   \
the  |    \                cat caught the mouse
     |     \
    cat     \              caught the mouse
            /\
           V  NP           caught the mouse
           |   \
        caught  \          the mouse
                /\
               D  N        the mouse
               |  |
              the |        mouse
                  |
                mouse

Each line in the figure represents a single step in the parse. Each nonterminal is replaced by the right-hand side defining it. Each time a terminal matches the input, the corresponding token is removed from the input.

Figure 2.5 illustrates bottom-up parsing by displaying both the parse tree and the remaining unrecognized input. Note that the parse tree is constructed up-side down, i.e., the parse tree is built in reverse.

Figure 2.5: Bottom-up Parse

PARSE TREE              UNRECOGNIZED INPUT

                        the cat caught the mouse
  
the                     cat caught the mouse
 |
 D                      cat caught the mouse
 |
 |  cat                 caught the mouse
 |   |
 |   N                  caught the mouse
 \  /
  NP                    caught the mouse
   |
   | caught             the mouse
   |   |
   |   V                the mouse
   |   | 
   |   | the            mouse
   |   |  |
   |   |  D             mouse
   |   |  |
   |   |  |  mouse
   |   |  |   |
   |   |  |   N
   |   |  \  /
   |   |   NP
   |   \  /
   |    VP
   \   /
     S

Each line represents a step in the parsing sequence. The input tokens shifted from the input to the parse tree when the parser is unable to reduce branches of the tree to a variable.

Table-driven and recursive descent parsing

The simplest and most intuitive approach to constructing a parser is to translate the grammar into a collection of recursive routines. Such a parser is called a recursive descent parser. A procedure parseN is constructed for each variable N. The right-hand side of the productions determine the body of the parse procedure. Variables in the right-hand side become calls to a parse procedure. Terminals in the right-hand side are translated to a verification check to make sure the input corresponds to the terminal and a procedure call to get the next input token. Additional details and restrictions on grammars are given in the Chapter: Translation.

An alternate approach is to construct top-down table-driven parser which consists of a driver routine, a stack and the grammar (usually stored in tabular form). The driver routine follows the following algorithm:

Initialize the stack with the start symbol of the grammar.
Repeat until no further actions are possible

If the top of the stack and the next input symbol are the same, pop the top of the stack and consume the input symbol.
If the top of the stack is a nonterminal symbol, pop the stack and push the right hand side of the corresponding grammar rule onto the stack.

If both the stack and input are empty, accept the input otherwise, reject the input.

To illustrate this approach we use the grammar G₁ for expressions and parse the expression id+id*id. Figure 2.6 contains a trace of the parse.

Figure 2.6 Top-down parse of id+id*id

STACK        INPUT          RULE/ACTION

   E]        id+id*id]     pop & push using E --> E+E
 E+E]        id+id*id]     pop & push using E --> id
id+E]        id+id*id]     pop & consume
  +E]          +id*id]     pop & consume
   E]           id*id]     pop & push using E --> E*E
 E*E]           id*id]     pop & push using E --> id
id*E]           id*id]     pop & consume
  *E]             *id]     pop & consume
   E]              id]     pop & push using E --> id
  id]              id]     pop & consume
    ]                ]     accept

The trace shows the contents of the stack and the remaining input at each step of the parse.

A third alternative is to construct a bottom-up table driven parser which consists of a driver routine, a stack and a grammar stored in tabular form. The driver routine follows the following algorithm:

Initially the stack is empty.
Repeat until no further actions are possible.

If the top n stack symbols match the right hand side of a grammar rule in reverse, then reduce the stack by replacing the n symbols with the left hand symbol of the grammar rule.
If no reduction is possible then shift the current input symbol to the stack.

If the input is empty and the stack contains only the start symbol of the grammar, then accept the input otherwise, reject the input.

To illustrate this approach we use the grammar G₁ for expressions and parse the expression id+id*id. Figure 2.7 contains a trace of the parse.

Figure 2.7: Bottom-up parse of id+id*id

STACK    INPUT        RULE/ACTION

      ]  id+id*id]    Shift
    id]    +id*id]    Reduce using E --> id
     E]    +id*id]    Shift
    +E]     id*id]    Shift
  id+E]       *id]    Reduce using E --> id
   E+E]       *id]    Shift
  *E+E]        id]    Shift
id*E+E]          ]    Reduce using E --> id
 E*E+E]          ]    Reduce using E --> E*E
   E+E]          ]    Reduce using E --> E+E
     E]          ]    Accept

The trace shows the contents of the stack and the remaining input at each step of the parse.

In these examples the choice of the which production to use may appear to be magical. In the case of a top-down parser, grammar G₁ should be rewritten to remove the ambiguity. For bottom up parsers, there are techniques for the analysis of the grammar to produce a set of unambiguous choices for productions. Such techniques are beyond the scope of this text.

Nondeterministic pushdown automata

A careful study of the parsing process reveals that whether the parse is top-down or bottom-up, the parser must hold some information on a stack. In the case of a recursive descent parser, the stack is implicit in the recursion. In the case of the top-down parser, it must pop variables off the stack and push the corresponding right-hand side on the stack and pop terminals off the stack when they match the input. In the case of the bottom-up parser, it must shift (push) terminals onto the stack from the input and reduce (pop) sequences of terminals and variables off the stack replacing them with a variable where the sequence of terminals and variables correspond to the right-hand side of some production.

This observation leads us to the notion of push-down automata. A push-down automata has an input that it scans from left to right, a stack, and a finite control to control the operations of reading the input and pushing data on and popping data off the stack. Definition 2.6 is a formal definition of a push-down automata.

Definition 2.6: Push-down automaton

A push-down automaton M is a 7-tuple (Q, Sigma, Tau, delta, q₀, Z₀, F)

Q is a finite set of states

Sigma is a finite alphabet called the input alphabet

Tau is a finite alphabet called the stack alphabet

delta is a transition function from Q× (Sigma union {e})× Tau to finite subsets of Q× Tau*

q₀ in Q is the initial state

Z₀ in Tau is called the start symbol

F a subset of Q; the set of accepting states

PDA = < States, StartState, FinalStates, InputAlphabet, Input, StackAlphabet, Stack, TransitionFunction, >
Configuration: C = State x Stack x Input; initial configuration (StartState, [], Input)

t : C --> C

Allowed transitions

t(s, [], []) -- accept (empty stack)

t(s, [], S) -- accept s in FinalStates

t(s, I, S) = (s', I, S) -- epsilon move

t(s, [i|I], S) = (s', I, S) -- consume input

t(s, I, [x|S]) = (s', I, S) -- pop stack

t(s, I, S) = (s', I, [x|S]) -- push stack

t(s, [i|I], [x|S]) = (s', I, S) -- consume input and pop stack

t(s, [i|I], S) = (s', I, [x|S]) -- consume input and push stack

Example: palindroms program (StartState, Input, [])

t(push,     [],     []) = accept              // empty input
t(push, [x|I],      S) = (pop, I,      S) // center, odd length palindrom
t(push, [x|I],      S) = (pop, I, [x|S]) // center, even length palindrom
t(push, [x|I],      S) = (push, I, [x|S]) // left side
t(pop, [x|I], [x|S]) = (pop, I,      S) // right side
t(pop,      [],      []) = accept

Applications of PDAs are explored in the exercises.

Equivalence of PDA and CFGs

Just as a grammar defines a language L(G), so a PDA M defines a language L(M), the set of strings that it accepts. The relationship between PDAs and CFG is interesting. Any language accepted by a PDA can be shown to be shown to have a context-free grammar. Also any context-free grammar defines a PDA. While the proof of these statements is beyond the scope of this text, the idea of the proof is this. The configurations of a PDA can be described in terms of a context-free grammar. All CFGs can be put into a special form (Greibach normal form) which can be used to describe the configurations of a PDA.

Regular Expressions

While CFGs can be used to describe the tokens of a programming languages, regular expressions (RE) are a more convenient notation for describing their simple structure. The alphabet consists of the character set chosen for the language and the notation includes

`·' to concatenate items (juxtaposition is used for the same purpose),
`|' to separate alternatives (often `+' is used for the same purpose),
`*' to indicate that the previous item may be repeated zero or more times, and
`(' and `)' for grouping.

Definition 2.7: Regular expressions and Regular languages

Regular Expression E Language Denoted L(E)

ø ø The empty set; language

lambda {lambda} empty string; language which consists of the empty string

a { a } a; the language which consists of just a

(E ·F) {uv | u in L(E) and v in L(F) } concatenation;

(E|F) {u | u in L(E) or u in L(F) } alternation; union of L(E) and L(F)

(E^*) {u₁u₂...u_n| u_i in L(E) 0 <= i <=n, n >=0 } any sequence from E

Regular Expression E	Language Denoted L(E)
ø	ø	The empty set; language
lambda	{lambda}	empty string; language which consists of the empty string
a	{ a }	a; the language which consists of just a
(E ·F)	{uv \| u in L(E) and v in L(F) }	concatenation;
(E\|F)	{u \| u in L(E) or u in L(F) }	alternation; union of L(E) and L(F)
(E^*)	{u₁u₂...u_n\| u_i in L(E) 0 <= i <=n, n >=0 }	any sequence from E

Identifiers and real numbers may be defined using regular expressions as follows:

integer = D D^*identifier = A(A|D)^*

A scanner is a program which groups the characters of an input stream into a sequence of tokens. Scanners based on regular expressions are easy to write. Lex is a scanner generator often packaged with the UNIX environment. A user creates a file containing regular expressions and Lex creates a program to search for tokens defined by those regular expressions. Text editors use regular expressions to search for and replace text. The UNIX grep command uses regular expressions to search for text.

Deterministic and nondeterministic Finite State Machines

Regular expressions are equivalent to finite state machines. A finite state machine consists of a set of states (one of which is a start state and one or more which are accepting states), a set of transitions from one state to another each labeled with an input symbol, and an input string. Each step of the finite state machine consists of comparing the current input symbol with the set of transitions corresponding to the current state and then consuming the input symbol and moving to the state corresponding to the selected transition. Definition 2.8 states this formally.

Definition 2.8: Finite State Automaton

A finite state automaton or fsa is defined by the quintuple M = (Q, Sigma, delta, q₀, F),
where
Q is a finite set of internal states
Sigma is a finite set of symbols called the input alphabet
delta: Q × Sigma --> 2^Q is a total function called the transition function
q₀ in Q is the initial state
F a subset of Q is the set of final states

FSM = <States, StartState, FinalStates, InputAlphabet, Input, TransitionFunction>
Configuration: C = State x Input; inititial configuration (StartState, Input)

t : C --> C

Allowed transitions

t(s, []) -- accept s in FinalStates

t(s, [x|I]) = (s', I) -- consume input

t(s, I) = (s', I) -- epsilon move

Example: identifiers (StartState, Input)

t(start, [i|I]) = (ad, I)
t(ad, [i|I]) = (ad, I)
t(ad, [d|I]) = (ad, I)
t(ad, []) -- accept

The transition function delta is defined on a state and an input symbol. It can be extended to a function delta^* on strings of input symbols as follows:

delta^*(q,-)=q for the empty string
delta^*(q,wa)=delta(delta^*(q,w),a) for all strings w and input symbols a

A FSA is called deterministic if there is at most one transition from one state to another for a given input and there are no lambda transitions. A FSA is called nondeterministic if there is one or more transitions from one state to another for a given input. A Moore machine is a FSA which associates an output with each state and a Mealy machine is a FSA which associates an output with each transition. The Moore and Mealy FSAs are important in applications of FSAs.

Equivalence of deterministic and nondeterministic fsa

It might seem that a machine that could `guess' (nondeterministic) which move to make next would be more powerful than one that could not (deterministic). The following theorems show that in the case of FSAs, it is not the case.

Theorem:

Proof:

Theorem: For every nondeterministic finite state automaton there is a deterministic finite state automaton that accepts the same language.
Proof:

₀

S' is the set of all subsets of S; an element of S' is denoted by [q₁...,q_m]
t': t'([q₁...,q_m],a) = [p₁,...,p_n] where [p₁,...,p_n] is the union of the states of S such that t(q_i,a) = p_j} F' is the set of all states of S' that contain an accepting state of M

The proof is completed by induction on the length of the input string.

Equivalence of fsa and regular expressions

Just as context-free grammars and PDAs are equivalent, so regular expressions and FSAs are equivalent as the following theorems show.

Theorem: If r is a regular expression then there exists an NFA that accepts L(r).
Proof (Sketch) The formal proof is by induction on the number of operators in the expression. For the base cases (empty set, empty string and singleton set) the corresponding fsa is simple to define. The more complex cases are handled by merging states of simpler fsa.

Theorem: If L is accepted by a DFA then L is denoted by a regular expression.
Proof beyond the scope of this text.

Graphical Representation

In a graphical representation, states are represented by circles, with final (or accepting) states indicated by two concentric circles. The start state is indicated by the word ``Start''. An arc from state s to state t labeled a indicates a transition from s to t on input a. A label a/b indicates that this transition produces an output b. A label a₁, a₂,..., a_k indicates that the transition is made on any of the inputs a₁, a₂,..., a_k.

/* NEED A NICE DIAGRAM HERE */

Tabular Representation

In a tabular representation, states are one index and inputs the other. The entries in the table are the next state (and actions if required).

Figure 2.8: Tabular representation for a transition function.
state/input I₀ ... I_n

S₀ S₀₀ ... S_0n

... ... ... ...

S_m S_m0 ... S_mn

**Figure 2.8:** Tabular representation for a transition function.
state/input	I₀	...	I_n
S₀	S₀₀	...	S_0n
...	...	...	...
S_m	S_m0	...	S_mn

Implementation of FSAs

The transition function of a FSA can be implemented as a case statement, a collection of procedures and as a table. In a case based representation state is represented by the value of a variable, the case statement is placed in the body of a loop and on each iteration of the loop, the input is read and the state variable updated.

State := Start;
repeat
   get input I
   case State of
        ...
        Si : case I of
                ...
                Ci : State := Sj;
                        ...
             end   
        ...
   end
until empty input and accepting state

In a procedural representation, each state is a procedure. Transitions to the next state occur when the procedure representing the next state is ``called''.

procedure StateS(I : input)
        case I of
                ...
                Ci : get input I; StateT(I)
                ...
        end

In the table-driven implementation, the transition function is encoded in a two dimensional array. One index is the current state the other is the current input. The array element are states.

state := start;
while state != final do
        get input I;
        state := table[state,I]

The implementations are incomplete since they do not contain code to deal with the end of input.

Pragmatics

At the semantics level, concrete syntax does not matter. However, concrete syntax does matter to the programmer and to the compiler writer. The programmer needs a language that is easy to read and write. The compiler writer wants a language that is easy to parse. Simple details such as placement of keywords, semicolons and case can complicate the life of the programmer or compiler writer.

Many languages are designed to designed to make compilation easy. The goal is to provide a syntax so that the compiler need make only one pass over the program. This requirement means that with minor exceptions, each constant, type, variable, procedure and function must be defined before it is referenced. The trade-off is between slightly increased syntactic complexity of the language with some increased in the burden on the programmer and a simpler compiler.

Some specific syntactical issues include:

Statement termination and/or separation. In Pascal the semicolon is a statement separator while in C the semicolon is a statement terminator. Thus in Pascal a semicolon is not necessary after the last statement in a sequence of statements while it is required in C. If a language includes an empty statement, a misplaced semicolon can change the meaning of a program. For example, in the program fragment

while C do; S;

the first semicolon terminates the empty statement following the do and the while statement; S is not in the body of the while statement.

Case sensitivity. Pascal is case insensitive while C is case sensitive.
Opening and closing keywords. Algol-68 and Modula-2 require closing keywords. Modula-2 uses and end while Algol-68 uses the reverse of the opening keyword for example,

if C then S fi

The assignment operator. The assignment operator varies among imperative programming languages.

:= Pascal and Ada

= FORTRAN, C/C++/Java

<-- APL

.EQ. FORTRAN

== C/C++/Java

Identification of function and procedure calls. In C, procedure calls are distinguished by the presence of parentheses. Pascal does not require parentheses.
Return values. In C, if a function is used as a command, its return value is ignored. In Pascal, a function cannot be used as a command. To ignore the returned value, Modula-3 requires the function call with the keyword EVAL.

A syntax directed editor can use color, font, and layout to assist the programmer in distinguishing between comments, reserved words, code, and can provide command completion.

Historical Perspectives and Further Reading

Backus-Naur Form

The BNF is a notation for describing the productions of a context-free grammar. The BNF uses the following symbols <, >, ::=, |. Variables are enclosed between < and >. The symbol --> is replaced with ::=. The symbol | is used to separate alternatives. Terminals are represented by themselves or are written in a type face different from the symbols of the BNF. The following is a BNF description of arithmetic expressions.

<Expression> ::= <Identifier> | <Number> |
<Expression> <Op> <Expression> |
( <Expression> )
<Op> ::= + | - | * | /
<Identifier> ::= <Letter>
<Identifier> ::= <Identifier> <Letter>
<Number> ::= <Digit>
<Number> ::= <Number> <Digit>
<Letter> ::= A | ... | Z
<Digit> ::= 0 | ... | 9

EBNF (extended BNF)

Several several extensions to improve the readability of the BNF have been suggested. One such extension is to write the names of the variables in italics and without < and >. In addition, the EBNF (extended BNF) is a combination of the BNF and the notation of regular expressions. An EBNF production rule is of the form N ::= E, where N is a nonterminal symbol and E is an extended regular expression. Like ordinary regular expressions, E may contain `|', `*', and parentheses for grouping but unlike ordinary regular expressions, it may contain variable symbols as well as terminal symbols.

Some additional extensions include the use of braces, {E}, or ellipses, E..., to indicate zero or more repetitions of an item and brackets, [E], to indicate an optional item.

Figure 2.8 contains a context-free grammar for a simple imperative programming language.

Figure 2.8: Context-free grammar for Simple

program ::= LET definitions IN command_sequence END

definitions ::= e | INTEGER id_seq IDENTIFIER .

id_seq ::= e | id_seq IDENTIFIER ,

command_sequence ::= e | command_sequence command ;

command := SKIP 
        |  READ IDENTIFIER
        |  WRITE exp
        |  IDENTIFIER := exp
        |  IF exp THEN command_sequence ELSE command_sequence FI 
        |  WHILE bool_exp DO command_sequence END 

exp ::=  exp + term | exp - term  | term
term :: term * factor | term / factor | factor 
factor ::= factor^primary  | primary 
primary ::=  NUMBER | IDENT | ( exp ) 
bool_exp ::=  exp = exp | exp < exp | exp > exp

Syntax

Slonneger & Kurts (1995): Formal Syntax and Semantics of Programming Languages Addison Wesley
Watt, David A. (1991): Programming Language Syntax and Semantics Prentice-Hall International.

Language Descriptions

It is instructive to read official language descriptions. The following are listed in historical order.

FORTRAN: Backus, J. W. et. al (1956) ``The FORTRAN Automatic Coding System'' in Great Papers in Computer Science by Laplante, P. ed. West. 1996.
LISP: McCarthy, J. (1960) Recursive Functions of Symbolic Expressions ACM Communications 3 4 April 1960, 184-195.
ALGOL 60: Naur, P., ed (1963) Revised Report on the Algorithmic Language ALGOL 60 Communications of the ACM. 6, 1-17.
Algol 68
Pascal: Jensen, K. and Wirth, N. (1974) Pascal User Manual and Report 2ed. Springer-Verlag
Ada: Reference Manual for the Ada Programming Language U.S. Department of Defense, ANSI/MIL-STD 1815A-1983, Washington, D. C., February, 1983.
C: Kernighan and Ritchie (1978) ``The C Reference Manual'' in The C Programming Language Prentice Hall.
C++
Java 1.02: (1996) The Java Language Specification
Scheme
Haskell 1.3: Peterson, John., ed (1996)a The Haskell Report 1.3
Prolog
Gödel: JACK (Java parser and scanner construction tool)

Formal Languages and Automata

TM = <States, InputAlphabet, TransitionFunction, FinalStates, StartState>
Configuration: C = State x Input; initial configuration (StartState, Input)

t : C --> C
Allowed transitions
t(s, I) -- accept s in FinalStates
t(s, I) = (s', I) -- epsilon move
t(s, (Left, x, [y|Right])) = (s', ([x'|Left], y, Right)) -- move right
t(s, ([x|Left], y, Right)) = (s', (Left, x, [y'|Right])) -- move left

For regular expressions and their relationship to finite automata and context-free grammars and their relationship to push-down automata see texts on formal languages and automata such as\cite{HU79}. The original paper on attribute grammars was by Knuth\cite{Knuth68}. For a more recent source and their use in compiler construction and compiler generators see \cite{DJL88,PittPet92}

Hopcroft and Ullman (1979): Introduction to Automata Theory, Languages, and Computation Addison-Wesley
Linz, Peter (1996): An Introduction to Formal Languages and Automata D. C. Heath and Company

Exercises

[time/difficulty](cfg) What is the size of L₀L₁?
(cfg) Is L₀L₁ = L₁L₀?
(cfg) Show that the grammar G₁ is ambiguous by producing two distinct derivation trees for the sentence: E + E * E.
(cfg) Define a grammar for the if-then and if-then-else control structures. Is your grammar is ambiguous? Hint: try producing two distinct derivation trees for the sentence: if C then if C then S else S.
(cfg, bnf, ebnf) Discuss the advantages and disadvantages of the following grammars for the if-then-else statements. Hint: consider the grammars from both the user and parser perspectives.

stmt --> begin stmts end

stmt

expr

stmt

expr

stmt

stmt --> if expr then stmts endif

stmt

expr

stmts

stmt --> if expr then stmts {elsif expr then stmts}[else stmts] end

(cfg) Does the order in which production rules are applied matter? Can they be applied in an arbitrary order including in parallel or in some random order?
(cfg) Can a fully abstract grammar be ambiguous?
(parse) In a top-down parse, what is required of the grammar so that the parser will be able to pick the correct production?
(parse) In a bottom-up parse, ...
(parse) Construct a recursive descent parser for G₀, the grammar for a fragment of English (see figure 2.1).
(pda) Construct a PDA which checks for matching parentheses.
(pda) Construct a PDA which recognizes palindromes.
(pda) Construct a PDA which translates arithmetic expressions from infix to post-fix.
(pda) Show that a PDA can recognize the language aⁿbⁿ.
(pda) Show that a PDA cannot recognize the language aⁿbⁿcⁿ.
(re) Define binary numbers using regular expressions.
(re) Define real numbers using regular expressions.
(re) Construct a scanner to recognize identifiers, numbers and arithmetic operators.
(re, parse) Using the following grammar for expressions:

exp

term exp

exp

term exp

epsilon

term

factor term

term

factor term

epsilon

factor

primary factor

factor

primary factor

epsilon

primary

exp

Construct a trace of a top down parse for the expression id+id*id.
Construct a scanner and a recursive descent parser for the grammar.

(re, parse) Construct a scanner and a parser for the programming language Simple
(pragmatics) Discuss the advantages and disadvantages of Pascal or C style function calls (C requires empty parentheses for parameterless functions while Pascal does not).
(pragmatics) Discuss the advantages and disadvantages of case sensitivity for the programmer and compiler writer.
(pragmatics) Discuss the consequences of the number of reserved words in a programming language.
(pragmatics) Discuss the necessity separators and terminators.
(pragmatics) Discuss the advantages and disadvantages of requiring declarations before references for the compiler writer and for the programmer.

by Anthony A. Aaby. Send comments to aabyan@wwc.edu

Q	is a finite set of states
Sigma	is a finite alphabet called the input alphabet
Tau	is a finite alphabet called the stack alphabet
delta	is a transition function from Q× (Sigma union {e})× Tau to finite subsets of Q× Tau*
q₀	in Q is the initial state
Z₀	in Tau is called the start symbol
F	a subset of Q; the set of accepting states

Syntax

Alphabets and Languages

Grammars and Languages

Abstract Syntax

Parsing

Table-driven and recursive descent parsing

Nondeterministic pushdown automata

Equivalence of PDA and CFGs

Regular Expressions

Deterministic and nondeterministic Finite State Machines

Equivalence of deterministic and nondeterministic fsa

Equivalence of fsa and regular expressions

Graphical Representation

Tabular Representation

Implementation of FSAs

Pragmatics

Historical Perspectives and Further Reading

Backus-Naur Form

EBNF (extended BNF)

Syntax

Language Descriptions

Parser (Compiler) Construction Tools

Formal Languages and Automata

Exercises