Keywords and phrases: Regular expression, regular grammar, context-free grammar, parse tree, ambiguity, BNF, context sensitivity, attribute grammar, inherited and synthesized attributes, scanner, lexical analysis, parser, static semantics.
Syntax is concerned with the structure of programs and layout with their appearance. The syntactic elements of a programming language are determined by the computation model and pragmatic concerns. There are well developed tools (regular, context-free and attribute grammars) for the description of the syntax of programming languages. Grammars are rewriting rules and may be used for both recognition and generation of programs. Grammars are independent of computational models and are useful for the description of the structure of languages in general.
Context-free grammars are used to describe the bulk of the language's structure; regular expressions are used to describe the lexical units (tokens); attribute grammars are used to describe the context sensitive portions of the language. Attribute grammars are described in a later chapter.
Definition 2.1: Alphabet and Language
- Sigma
- An alphabet Sigma is a nonempty, finite set of symbols.
- L
- A language L over an alphabet Sigma is a collection of strings of elements of Sigma. The empty string lambda is a string with no symbols at all.
- Sigma*
- The set of all possible finite strings of elements of Sigma is denoted by Sigma*. Lambda is an element of Sigma*.
A string is a finite sequence of symbols from an alphabet, Sigma. The concatenation of two strings v and w is the string wv obtained by appending the string w to the right of string v.
Programming languages require two levels of description, the lowest level is that of a token. The tokens of a programming language are the keywords, identifiers, constants and other symbols appearing in the language. In the program
the tokens arevoid main() { printf("Hello World\n"); }
The alphabet for the language of the lexical tokens is the character set while the alphabet for a programming language is the set of lexical tokens;void, main, (, ), {, printf, (, "Hello World\n", ), ;, }
A string in a language L is called a sentence and is an element of Sigma*. Thus a language L is a subset of Sigma*. Sigma+ is the set of all possible nonempty strings of Sigma, so Sigma+ = Sigma* - { lambda }. A token is a sentence in the language for tokens and a program is a sentence in the language of programs.
If L0 and L1 are languages, then L0L1 denotes the language {xy | x is in L0, and y is in L1 }. That is L0L1 consists of all possible concatenations of a string from L0 followed by a string from L1.
A grammar consists of a finite collection of grammatical categories (e.g. noun phrase, verb phrase, article, noun, verb etc), individual words (elements of the alphabet), rules for describing the order in which elements of the grammatical categories must appear and there must be a most general grammatical category. Figure 2.1 contains a context-free grammar for a fragment of English.
Figure 2.1: G0 a grammar for a fragment of English The grammatical categories are: S, NP, VP, D, N, V.
The words are: a, the, cat, mouse, ball, boy, girl, ran, bounced, caught.
The grammar rules are:The most general category is S, a sentence.S --> NP VP NP --> N NP --> D N VP --> V VP --> V NP V --> ran | bounced | caught D --> a | the N --> cat | mouse | ball | boy | girl
In a context-free grammar, the grammatical categories are called variables, the words (tokens) are called terminals, the grammar rules are rewriting rules called productions, and the most general grammatical category is called the start symbol. This terminology is restated in Definition 2.2.
Definition 2.2: Context-free grammar Context-free grammar G is a quadruple
G = (V, T, P, S) where
V is a finite set of variable symbols,
T is a finite set of terminal symbols disjoint from V,
P is a finite set of rewriting rules (productions) of the formA --> w where A in V, w in (V union T)* S is an element of V called the start symbol.
Grammars may be used to generate the sentences of a language. Given a string w of the form
the production x --> y is applicable to this string since x appears in the string. The production allows us to replace x with y obtaining the string z
and say that w derives z. This is written as
If
we say that w1 derives wn and write
The set of sentences of a language are derived from the start symbol of the grammar. Definition 2.3 formalizes these ideas.
Let G be a grammar. Then the setL(G) = {w in T* | S ==>* w} is the language generated by G.
A language L is context-free iff there is a context-free grammar G such that L = L(G).
If w in L(G), then the sequence
S ==> w1 ==> w2 ==> ... ==> wn ==> w is a derivation of the sentence w and the wi are called sentential forms.
Using the grammar G0 the sentence the cat caught the mouse can be generated as follows:
This derivation is performed in a leftmost manner. That is, in each step the leftmost variable in the sentential form is replaced.S ==> NP VP ==> D N VP ==> the N VP ==> the cat VP ==> the cat V NP ==> the cat caught NP ==> the cat caught D N ==> the cat caught the N ==> the cat caught the mouse
Sometimes a derivation is more readable if it is displayed in the form of a derivation tree.
The notion of a tree based derivation is formalized in Definition 2.5.S / \ NP VP /\ /\ D N V NP / / / \ the cat caught \ /\ D N \ \ the mouse
Let G = (V, T, P, S) be a context-free grammar. A derivation tree has the following properties.
- The root is labeled S.
- Every interior vertex has a label from V.
- If a vertex has label A in V, and its children are labeled (from left to right) a1, ..., an, then P must contain a production of the form
A --> a1...an - Every leaf has a label from T union {lambda}.
In the generation example we chose to rewrite the left-most nonterminal first. When there are two or more left-most derivations of a string in a given grammar or, equivalently, there are two distinct derivation trees for the same sentence, the grammar is said to be ambiguous. In some instances, ambiguity may be eliminated by the selection of another grammar for the language or adding rules which may not be context-free rules. Definition 2.6 defines ambiguity in terms of derivation trees.
A context-free grammar G is said to be ambiguous if there exists some w in L(G) which has two distinct derivation trees.
Figure 2.2: G1 An expression grammar V = { E } T = { c, id, +, *, (, ) } P = {E --> c, E --> id, E --> (E), E --> E + E, E --> E * E } S = E
We assume that c and id stand for any constants and identifiers respectively. Concrete syntax is concerned with the hierarchical relationships and the particular symbols used. The main point of abstract syntax is to omit the details of physical representation, specifying the pure structure of the language by specifying the logical relations between parts of the language. A grammar defining the abstract syntax of arithmetic expressions is grammar G2 in Figure 2.3.
Figure 2.3: G2 An abstract expression grammar V = { E } T = { c, id, add, mult} P = {E --> c, E --> id, E --> add E E , E --> mult E E } S = E
The terminal symbols are names for classes of objects.
An additional difference between concrete and abstract syntax appeThe key difference in the use of concrete and abstract grammars is best illustrated by comparing the derivation tree and the abstract syntax tree for the expression id + (id * id). The derivation tree for the concrete grammar is just what we would expect
while the abstract syntax tree for the abstract grammar is quite different.E /|\ E + E / /|\ id ( E ) /|\ E * E / \ id id
In a derivation tree for an abstract grammar, the internal nodes are labeled with the operator and the the operands are their children and there are no concrete symbols in the tree. Abstract syntax trees are used by compilers for an intermediate representation of the program.add / \ id mult /\ id id
Concrete syntax defines the way programs are written while abstract syntax describes the pure structure of a program by specifying the logical relation between parts of the program. Abstract syntax is important when we are interested in understanding the meaning of a program (its semantics) and when translating a program to machine code.
There are two approaches to parsing. The parser can begin with the start symbol of the grammar and attempt to generate the same sentence that it is attempting to recognize or it can try to match the input to the right-hand side of the productions building a derivation tree in reverse. The first approach is called top-down parsing and the second, bottom-up parsing.
Figure 2.4 illustrates top-down parsing by displaying both the parse tree and the remaing unrecognized input. The input is scanned from left to right one token at a time.
Figure 2.4: Top-down Parse PARSE TREE UNRECOGNIZED INPUT S the cat caught the mouse /\ NP VP the cat caught the mouse / \ \ D N \ the cat caught the mouse | | \ the | \ cat caught the mouse | \ cat \ caught the mouse /\ V NP caught the mouse | \ caught \ the mouse /\ D N the mouse | | the | mouse | mouse
Figure 2.5 illustrates bottom-up parsing by displaying both the parse tree and the remaining unrecognized input. Note that the parse tree is constructed up-side down, i.e., the parse tree is built in reverse.
Figure 2.5: Bottom-up Parse PARSE TREE UNRECOGNIZED INPUT the cat caught the mouse the cat caught the mouse | D cat caught the mouse | | cat caught the mouse | | | N caught the mouse \ / NP caught the mouse | | caught the mouse | | | V the mouse | | | | the mouse | | | | | D mouse | | | | | | mouse | | | | | | | N | | \ / | | NP | \ / | VP \ / S
Each line represents a step in the parsing sequence. The input tokens shifted from the input to the parse tree when the parser is unable to reduce branches of the tree to a variable.
An alternate approach is to construct top-down table-driven parser which consists of a driver routine, a stack and the grammar (usually stored in tabular form). The driver routine follows the following algorithm:
STACK INPUT RULE/ACTION E] id+id*id] pop & push using E --> E+E E+E] id+id*id] pop & push using E --> id id+E] id+id*id] pop & consume +E] +id*id] pop & consume E] id*id] pop & push using E --> E*E E*E] id*id] pop & push using E --> id id*E] id*id] pop & consume *E] *id] pop & consume E] id] pop & push using E --> id id] id] pop & consume ] ] accept
The trace shows the contents of the stack and the remaining input at each step of the parse.
A third alternative is to construct a bottom-up table driven parser which consists of a driver routine, a stack and a grammar stored in tabular form. The driver routine follows the following algorithm:
STACK INPUT RULE/ACTION ] id+id*id] Shift id] +id*id] Reduce using E --> id E] +id*id] Shift +E] id*id] Shift id+E] *id] Reduce using E --> id E+E] *id] Shift *E+E] id] Shift id*E+E] ] Reduce using E --> id E*E+E] ] Reduce using E --> E*E E+E] ] Reduce using E --> E+E E] ] Accept
The trace shows the contents of the stack and the remaining input at each step of the parse.
In these examples the choice of the which production to use may appear to be magical. In the case of a top-down parser, grammar G1 should be rewritten to remove the ambiguity. For bottom up parsers, there are techniques for the analysis of the grammar to produce a set of unambiguous choices for productions. Such techniques are beyond the scope of this text.
This observation leads us to the notion of push-down automata. A push-down automata has an input that it scans from left to right, a stack, and a finite control to control the operations of reading the input and pushing data on and popping data off the stack. Definition 2.6 is a formal definition of a push-down automata.
A push-down automaton M is a 7-tuple (Q, Sigma, Tau, delta, q0, Z0, F)
Q is a finite set of states Sigma is a finite alphabet called the input alphabet Tau is a finite alphabet called the stack alphabet delta is a transition function from Q× (Sigma union {e})× Tau to finite subsets of Q× Tau* q0 in Q is the initial state Z0 in Tau is called the start symbol F a subset of Q; the set of accepting states
Example: palindroms program (StartState, Input, [])
t(push, [], []) = accept // empty input
t(push, [x|I], S) = (pop, I, S) // center, odd length palindrom
t(push, [x|I], S) = (pop, I, [x|S]) // center, even length palindrom
t(push, [x|I], S) = (push, I, [x|S]) // left side
t(pop, [x|I], [x|S]) = (pop, I, S) // right side
t(pop, [], []) = accept
Applications of PDAs are explored in the exercises.
Regular Expression E Language Denoted L(E) ø ø The empty set; language lambda {lambda} empty string; language which consists of the empty string a { a } a; the language which consists of just a (E ·F) {uv | u in L(E) and v in L(F) } concatenation; (E|F) {u | u in L(E) or u in L(F) } alternation; union of L(E) and L(F) (E*) {u1u2...un| ui in L(E) 0 <= i <=n, n >=0 } any sequence from E
Identifiers and real numbers may be defined using regular expressions as follows:
A scanner is a program which groups the characters of an input stream into a sequence of tokens. Scanners based on regular expressions are easy to write. Lex is a scanner generator often packaged with the UNIX environment. A user creates a file containing regular expressions and Lex creates a program to search for tokens defined by those regular expressions. Text editors use regular expressions to search for and replace text. The UNIX grep command uses regular expressions to search for text.integer = D D* identifier = A(A|D)*
A finite state automaton or fsa is defined by the quintupleM = (Q, Sigma, delta, q0, F), where
Q is a finite set of internal states
Sigma is a finite set of symbols called the input alphabet
delta: Q × Sigma --> 2Q is a total function called the transition function
q0 in Q is the initial state
F a subset of Q is the set of final states
Example: identifiers (StartState, Input)t : C --> C Allowed transitions t(s, []) -- accept s in FinalStates t(s, [x|I]) = (s', I) -- consume input t(s, I) = (s', I) -- epsilon move
t(start, [i|I]) = (ad, I)The transition function delta is defined on a state and an input symbol. It can be extended to a function delta* on strings of input symbols as follows:
t(ad, [i|I]) = (ad, I)
t(ad, [d|I]) = (ad, I)
t(ad, []) -- accept
Theorem: For every nondeterministic finite state automaton there
is a deterministic finite state automaton that accepts the same language.
Proof:
S' is the set of all subsets of S; an element of S' is denoted by [q1...,qm]
t': t'([q1...,qm],a) = [p1,...,pn]
where [p1,...,pn] is the union of the states of S
such that t(qi,a) = pj} F' is the set of all states
of S' that contain an accepting state of M
The proof is completed by induction on the length of the input string.
Theorem: If r is a regular expression then there exists an NFA that
accepts L(r).
Proof (Sketch) The formal proof is by induction on the number
of operators in the expression. For the base cases (empty set, empty string
and singleton set) the corresponding fsa is simple to define. The more
complex cases are handled by merging states of simpler fsa.
Theorem: If L is accepted by a DFA then L is denoted by a regular expression.
Proof beyond the scope of this text.
/* NEED A NICE DIAGRAM HERE */
Figure 2.8: Tabular representation for a transition function. state/input I0 ... In S0 S00 ... S0n ... ... ... ... Sm Sm0 ... Smn
In a procedural representation, each state is a procedure. Transitions to the next state occur when the procedure representing the next state is ``called''.State := Start; repeat get input I case State of ... Si : case I of ... Ci : State := Sj; ... end ... end until empty input and accepting state
In the table-driven implementation, the transition function is encoded in a two dimensional array. One index is the current state the other is the current input. The array element are states.procedure StateS(I : input) case I of ... Ci : get input I; StateT(I) ... end
The implementations are incomplete since they do not contain code to deal with the end of input.state := start; while state != final do get input I; state := table[state,I]
Many languages are designed to designed to make compilation easy. The goal is to provide a syntax so that the compiler need make only one pass over the program. This requirement means that with minor exceptions, each constant, type, variable, procedure and function must be defined before it is referenced. The trade-off is between slightly increased syntactic complexity of the language with some increased in the burden on the programmer and a simpler compiler.
Some specific syntactical issues include:
the first semicolon terminates the empty statement following the do and the while statement; S is not in the body of the while statement.
The choice in FORTRAN and C/C++/Java is unfortunate since assignment is different from equality.
:= Pascal and Ada = FORTRAN, C/C++/Java <-- APL
.EQ. FORTRAN == C/C++/Java
<Expression> ::= <Identifier> | <Number> |
<Expression> <Op> <Expression> | ( <Expression> ) <Op> ::= + | - | * | / <Identifier> ::= <Letter> <Identifier> ::= <Identifier> <Letter> <Number> ::= <Digit> <Number> ::= <Number> <Digit> <Letter> ::= A | ... | Z <Digit> ::= 0 | ... | 9 |
Some additional extensions include the use of braces, {E}, or ellipses, E..., to indicate zero or more repetitions of an item and brackets, [E], to indicate an optional item.
Figure 2.8 contains a context-free grammar for a simple imperative programming language.
Figure 2.8: Context-free grammar for Simple program ::= LET definitions IN command_sequence END definitions ::= e | INTEGER id_seq IDENTIFIER .id_seq ::= e | id_seq IDENTIFIER , command_sequence ::= e | command_sequence command ; command := SKIP | READ IDENTIFIER | WRITE exp | IDENTIFIER := exp | IF exp THEN command_sequence ELSE command_sequence FI | WHILE bool_exp DO command_sequence END exp ::= exp + term | exp - term | term term :: term * factor | term / factor | factor factor ::= factor^primary | primary primary ::= NUMBER | IDENT | ( exp ) bool_exp ::= exp = exp | exp < exp | exp > exp
t : C --> C
Allowed transitions
t(s, I) -- accept s in FinalStates
t(s, I) = (s', I) -- epsilon move
t(s, (Left, x, [y|Right])) = (s', ([x'|Left], y, Right)) -- move right
t(s, ([x|Left], y, Right)) = (s', (Left, x, [y'|Right])) -- move left
exp' ::= + term exp' | - term exp' | epsilon term ::= factor term' term' ::= * factor term' | / factor term' | epsilon factor ::= primary factor' factor' ::= ^ primary factor' | epsilon primary ::= INT | IDENT | ( exp ) |