About Global Documents
Global Documents provides you with documents from around the globe on a variety of topics for your enjoyment.
Global Documents utilizes edocr for all its document needs due to edocr's wonderful content features. Thousands of professionals and businesses around the globe publish marketing, sales, operations, customer service and financial documents making it easier for prospects and customers to find content.
Sohail Aslam
Compiler Construction Notes
1
Lecture 1
Course Organization
The course is organized around theory and significant amount of practice. The practice
will be in the form of home works and a project. The project is the highlight of the
course: you will build a full compiler for subset of Java like language. The
implementation will in C++ and you will generate Intel x86 assembly language code. The
project will be done in six parts; each will be a programming assignment.
The grade distribution will be
The primary text for the course is Compilers – Principles, Techniques and Tools by Aho,
Sethi and Ullman. This is also called the Dragon Book; here is the image on the cover of
the book:
Theory Homeworks
10%
Exams
50%
Practice Project
40%
Sohail Aslam
Compiler Construction Notes
2
Why Take this Course
There are number of reason for why you should take this course. Let’s go through a few
Reason #1: understand compilers and languages
We all have used one or computer languages to write programs. We have used compilers
to “compile” our code and eventually turn it into an executable. While we worry about
data structures, algorithms and all the functionality that our application is supposed to
provide, we perhaps overlook the programming language, the structure of the code and
the language semantics. In this course, we will attempt to understand the code structure,
understand language semantics and understand relation between source code and
generated machine code. This will allow you to become a better programmer
Reason #2: nice balance of theory and practice
We have studied a lot of theory in various CS courses related to languages and grammar.
We have covered mathematical models: regular expressions, automata, grammars and
graph algorithms that use these models. We will now have an opportunity to put this
theory into practice by building a real compiler.
Reason #3: programming experience
Creating a compiler entails writing a large computer program which manipulates
complex data structures and implement sophisticated algorithm. In the process, we will
learn more about C++ and Intel x86 assembly language. The experience, however, will
be applicable if we desire to use another programming language, say Java, and generate
code for architecture other than Intel.
What are Compilers
Compilers translate information from one representation to another. Thus, a tool that
translates, say, Russian into English could be labeled as a compiler. In this course,
however, information = program in a computer language. In this context, we will talk of
compilers such as VC, VC++, GCC, JavaC FORTRAN, Pascal, VB. Application that
convert, for example, a Word file to PDF or PDF to Postscript will be called
“translators”. In this course we will study typical compilation: from programs written in
high level languages to lowlevel object code and machine code.
Sohail Aslam
Compiler Construction Notes
3
Typical Compilation
Consider the source code of C function
int expr( int n )
{
int d;
d = 4*n*n*(n+1)*(n+1);
return d;
}
Expressing an algorithm or a task in C is optimized for human readability and
comprehension. This presentation matches human notions of grammar of a programming
language. The function uses named constructs such as variables and procedures which aid
human readability. Now consider the assembly code that the C compiler gcc generates
for the Intel platform:
.globl _expr
_expr:
pushl %ebp
movl %esp,%ebp
subl $24,%esp
movl 8(%ebp),%eax
movl %eax,%edx
leal 0(,%edx,4),%eax
movl %eax,%edx
imull 8(%ebp),%edx
movl 8(%ebp),%eax
incl %eax
imull %eax,%edx
movl 8(%ebp),%eax
incl %eax
imull %eax,%edx
movl %edx,4(%ebp)
movl 4(%ebp),%edx
movl %edx,%eax
jmp L2
.align 4
L2:
leave
ret
The assembly code is optimized for hardware it is to run on. The code consists of
machine instructions, uses registers and unnamed memory locations. This version is
much harder to understand by humans
Sohail Aslam
Compiler Construction Notes
4
Issues in Compilation
The translation of code from some human readable form to machine code must be
“correct”, i.e., the generated machine code must execute precisely the same computation
as the source code. In general, there is no unique translation from source language to a
destination language. No algorithm exists for an “ideal translation”.
Translation is a complex process. The source language and generated code are very
different. To manage this complex process, the translation is carried out in multiple
passes.
Sohail Aslam
Compiler Construction Notes
1
Lecture 2
Twopass Compiler
The figure above shows the structure of a twopass compiler. The front end maps legal
source code into an intermediate representation (IR). The back end maps IR into target
machine code. An immediate advantage of this scheme is that it admits multiple front
ends and multiple passes.
The algorithms employed in the front end have polynomial time complexity while
majority of those in the backend are NPcomplete. This makes compiler writing a
challenging task.
Let us look at the details of the front and back ends.
Front End
scanner
parser
source
code
tokens
IR
errors
The front end recognizes legal and illegal programs presented to it. When it encounters
errors, it attempts to report errors in a useful way. For legal programs, front end produces
IR and preliminary storage map for the various data structures declared in the program.
The front end consists of two modules:
1. Scanner
2. Parser
Sohail Aslam
Compiler Construction Notes
2
Scanner
The scanner takes a program as input and maps the character stream into “words” that are
the basic unit of syntax. It produces pairs – a word and its part of speech. For example,
the input
x = x + y
becomes
We call the pair “
, new, while, if.
Parser
The parser takes in the stream of tokens, recognizes context free syntax and reports
errors. It guides contextsensitive (“semantic”) analysis for tasks like type checking. The
parser builds IR for source program.
The syntax of most programming languages is specified using ContextFree Grammars
(CFG). Context free syntax is specified with a grammar G=(S,N,T,P) where
• S is the start symbol
• N is a set of nonterminal symbols
• T is set of terminal symbols or words
• P is a set of productions or rewrite rules
For example, the ContextFree Grammar for arithmetic expressions is
1.
goal
?
expr
2.
expr ?
expr op term
3.

term
4.
term
?
number
5.

id
6.
op
?
+
7.

–
For this CFG,
S = goal
Sohail Aslam
Compiler Construction Notes
3
T = { number, id, +, }
N = { goal, expr, term, op}
P = { 1, 2, 3, 4, 5, 6, 7}
Given a CFG, we can derive sentences by repeated substitution. Consider the sentence
x + 2 – y
Production
Result
goal
1: goal ? expr
expr
2: expr ? expr op term
expr op term
5: term ? id
expr op y
7: op ? –
expr – y
2: expr ? expr op term
expr op term – y
4: term ? number
expr op 2 – y
6: op ? +
expr + 2 – y
3: expr ? term
term + 2 – y
5: term ? id
x + 2 – y
To recognize a valid sentence in some CFG, we reverse this process and build up a parse,
thus the name “parser”.
Sohail Aslam
Compiler Construction Notes
1
Lecture 3
A parse can be represented by a tree: parse tree or syntax tree. For example, here is the
parse tree for the expression x+2y
The parse tree captures all rewrite during the derivation. The derivation can be extracted
by starting at the root of the tree and working towards the leaf nodes.
Abstract Syntax Trees
The parse tree contains a lot of unneeded information. Compilers often use an abstract
syntax tree (AST). For example, the AST for the above parse tree is
This is much more concise; AST summarizes grammatical structure without the details of
derivation. ASTs are one kind of intermediate representation (IR).
The Back End
–
+
Sohail Aslam
Compiler Construction Notes
2
The back end of the compiler translates IR into target machine code. It chooses machine
(assembly) instructions to implement each IR operation. The back end ensure
conformance with system interfaces. It decides which values to keep in registers in order
to avoid memory access; memory access is far slower than register access.
The back end is responsible for instruction selection so as to produce fast and compact
code. Modern processors have a rich instruction set. The back end takes advantage of
target features such as addressing modes. Usually, instruction selection is viewed as a
pattern matching problem that can be solved by dynamic programming based algorithms.
Instruction selection in compilers was spurred by the advent of the VAX11 which had a
CISC (Complex Instruction Set Computer) architecture. The VAX11 had a large
instruction set that the compiler could work with.
Sohail Aslam
Compiler Construction Notes
1
Lecture 4
CISC architecture provided a rich set of instructions and addressing modes but it made
the job of the compiler harder when it came to generate efficient machine code. The
RISC architecture simplified this problem.
Register Allocation
Registers in a CPU play an important role for providing high speed access to operands.
Memory access is an order of magnitude slower than register access. The back end
attempts to have each operand value in a register when it is used. However, the back end
has to manage a limited set of resources when it comes to the register file. The number of
registers is small and some registers are preallocated for specialize used, e.g., program
counter, and thus are not available for use to the back end. Optimal register allocation is
NPComplete.
Instruction Scheduling
Modern processors have multiple functional units. The back end needs to schedule
instructions to avoid hardware stalls and interlocks. The generated code should use all
functional units productively. Optimal scheduling is NPComplete in nearly all cases.
Threepass Compiler
There is yet another opportunity to produce efficient translation: most modern compilers
contain three stages. An intermediate stage is used for code improvement or optimization.
The topology of a threepass compiler is shown in the following figure:
Front
End
machine
code
errors
Middle
End
Back
End
IR
IR
The middle end analyzes IR and rewrites (or transforms) IR. Its primary goal is to reduce
running time of the compiled code. This may also improve space usage, power
consumption, etc. The middle end is generally termed the “Optimizer”. Modern
optimizers are structured as a series of passes:
Sohail Aslam
Compiler Construction Notes
2
Opt
1
IR
IR
errors
Opt
2
Opt
n
IR
IR Opt
3
IR
Typical transformations performed by the optimizer are:
• Discover & propagate some constant value
• Move a computation to a less frequently executed place
• Specialize some computation based on context
• Discover a redundant computation & remove it
• Remove useless or unreachable code
• Encode an idiom in some particularly efficient form
Role of Runtime System
The executable code typically runs as a process in an Operating System Environment.
The application will need a number of resources from the OS. For example, dynamic
memory allocation and input output. The process may spawn more processes of threads.
If the underline architecture has multiple processors, the application may want to use
them. Processes communicate with other and may share resources. Compilers need to
have an intimate knowledge of the runtime system to make effective use of the runtime
environment and machine resources. The issues in this context are:
• Memory management
• Allocate and deallocate memory
• Garbage collection
• Runtime type checking
• Error/exception processing
•
Interface to OS – I/O
• Support for parallelism
• Parallel threads
• Communication and synchronization
Sohail Aslam
Compiler Construction Notes
1
Lecture 5
Lexical Analysis
The scanner is the first component of the frontend of a compiler; parser is the second
scanner
parser
source
code
tokens
IR
errors
The task of the scanner is to take a program written in some programming language as a
stream of characters and break it into a stream of tokens. This activity is called lexical
analysis. A token, however, contains more than just the words extracted from the input.
The lexical analyzer partition input string into substrings, called words, and classifies
them according to their role.
Tokens
A token is a syntactic category in a sentence of a language. Consider the sentence:
He wrote the program
of the natural language English. The words in the sentence are: “He”, “wrote”, “the” and
“program”. The blanks between words have been ignored. These words are classified as
subject, verb, object etc. These are the roles. Similarly, the sentence in a programming
language like C:
if(b == 0) a = b
the words are “if”, “(”, “b”, “==”, “0”, “)”, “a”, “=” and “b”. The roles are keyword,
variable, boolean operator, assignment operator. The pair
token. Here are some familiar tokens in programming languages:
•
Identifiers: x y11 maxsize
• Keywords: if else while for
•
Integers: 2 1000 44 5L
• Floats: 2.0 0.0034 1e5
• Symbols: ( ) + * / { } < > ==
• Strings: “enter x” “error”
Sohail Aslam
Compiler Construction Notes
2
Adhoc Lexer
The task of writing a scanner is fairly straight forward. We can handwrite code to
generate tokens. We do this by partitioning the input string by reading lefttoright,
recognizing one token at a time. We will need to lookahead in order to decide where one
token ends and the next token begins. The following C++ code present is template for a
Lexer class. An object of this class can produce the desired tokens from the input
stream.
class Lexer
{
Inputstream s;
char next; //look ahead
Lexer(Inputstream _s)
{
s = _s;
next = s.read();
}
Token nextToken() {
if( idChar(next) )return readId();
if( number(next) )return readNumber();
if( next == ‘”’ ) return readString();
...
...
}
Token readId() {
string id = “”;
while(true){
char c = input.read();
if(idChar(c) == false)
return new Token(TID,id);
id = id + string(c);
}
}
boolean idChar(char c)
{
if( isAlpha(c) ) return true;
if( isDigit(c) ) return true;
if( c == ‘_’ ) return true;
return false;
}
Token readNumber(){
string num = “”;
while(true){
Sohail Aslam
Compiler Construction Notes
3
next = input.read();
if( !isNumber(next))
return new Token(TNUM,num);
num = num+string(next);
}
}
This works ok, however, there are some problem that we need to tackle.
• We do not know what kind of token we are going to read from seeing first
character.
•
If token begins with “i”, is it an identifier “i” or keyword “if”?
•
If token begins with “=”, is it “=” or “==”?
We can extend the Lexer class but there are a number of other issues that can make the
task of handwriting a lexer tedious. We need a more principled approach. The most
frequently used approach is to use a lexer generator that generates efficient tokenizer
automatically.
Sohail Aslam
Compiler Construction Notes
1
Lecture 6
How to Describe Tokens?
Regular Languages are the most popular for specifying tokens because
• These are based on simple and useful theory,
• Are easy to understand and
• Efficient implementations exist for generating lexical analysers based on such
languages.
Languages
Let Σ ?be a set of characters. Σ is called the alphabet. A language over Σ is set of strings
of characters drawn from Σ.? Here are some examples of languages:
• Alphabet = English characters
Language = English sentences
• Alphabet = ASCII
Language = C++, Java, C# programs
Languages are sets of strings (finite sequence of characters). We need some notation for
specifying which sets we want. For lexical analysis we care about regular languages.
Regular languages can be described using regular expressions. Each regular expression is
a notation for a regular language (a set of words). If A is a regular expression, we write
L(A) to refer to language denoted by A.
Regular Expression
A regular expression (RE) is defined inductively
a
ordinary character from Σ
ε
the empty string
RS
either R or S
RS
R followed by S (concatenation)
R*
concatenation of R zero or more times (R* = ε RRRRRR...)
Regular expression extensions are used as convenient notation of complex RE:
R?
ε  R (zero or one R)
R+
RR* (one or more R)
(R)
R (grouping)
[abc] abc (any of listed)
[az]
ab....z (range)
[^ab] cd... (anything but ‘a’‘b’)
Sohail Aslam
Compiler Construction Notes
2
Here are some Regular Expressions and the strings of the language denoted by the RE.
RE
Strings in L(R)
a
“a”
ab
“ab”
ab
“a” “b”
(ab)*
“” “ab” “abab” ...
(aε)b
“ab” “b”
Here are examples of common tokens found in programming languages.
digit
‘0’’1’’2’’3’’4’’5’’6’’7’’8’’9’
integer
digit digit*
identifier
[azAZ_][azAZ09_]*
Finite Automaton
We need mechanism to determine if an input string w belongs to L(R), the language
denoted by regular expression R. Such a mechanism is called an acceptor.
input
string
language
w
L
acceptor
yes, if w ε L
no, if w ε L
The acceptor is based on Finite Automata (FA). A Finite Automaton consists of
• An input alphabet Σ
• A set of states
• A start (initial) state
• A set of transitions
• A set of accepting (final) states
A finite automaton accepts a string if we can follow transitions labeled with characters in
the string from start state to some accepting state. Here are some examples of FA.
• A FA that accepts only “1”
1
• A FA that accepts any number of 1’s followed by a single 0
Sohail Aslam
Compiler Construction Notes
3
0
1
• A FA that accepts ab*a (Σ : {a,b})
a
b
a
Sohail Aslam
Compiler Construction Notes
1
Lecture 7
Table Encoding of FA
A FA can be encoded as a table. This is called a transition table. The following example
shows a FA encoded as a table.
a
b
a
0
1
2
a
b
0
1
err
1
2
1
2
err
err
The rows correspond to states. The characters of the alphabet set Σ appear in columns.
The cells of the table contain the next state. This encoding makes the implementation of
the FA simple and efficient. It is equally simple to simulate or run the FA given an
alphabet and a string of the language and its associated alphabet set Σ. The C++ code
shows such a FA simulator.
int trans_table[NSTATES][NCHARS];
int accept_states[NSTATES];
int state = INITIAL;
while(state != err){
c = input.read();
if(c == EOF ) break;
state=trans_table[state][c];
}
return accept_states[state];
RE ? Finite Automata
We now have a strategy for building lexical analyzer. The tokens we want to recognize
are encoded using regular expressions. If we can build a FA for regular expressions, we
have our lexical analyzer. The question is can we build a finite automaton for every
regular expression? The answer, fortunately, is yes – build FA inductively based on the
definition of Regular Expression (RE).
Sohail Aslam
Compiler Construction Notes
2
The actual algorithm actually builds Nondeterministic Finite Automaton (NFA) from RE.
Each RE is converted into an NFA. NFA are joined together with εmoves. The eventual
NFA is then converted into a Deterministic Finite Automaton (DFA) which can be
encoded as a transition table. Let us discuss how this happens.
Nondeterministic Finite Automaton (NFA)
An NFA can have multiple transitions for one input in a given state. In the following
NFA, and input of 1 can cause the automaton to go to state B or C.
1
1
0
A
B
C
It can also have e–moves; the automaton machine can move from state A to state B
without consuming input.
ε
A
B
The operation of the automaton is not completely defined by input. A NFA can choose
whether to make εmoves and which of multiple transitions to take for a single input. The
acceptance of NFA for a given string is achieved if it can get in a final state.
Deterministic Finite Automaton (NFA)
In Deterministic Finite Automata (DFA), on the other hand, there is only one transition
per input per state. There are no εmoves. Upon execution of the automaton, a DFA can
take only one path through the state graph and is therefore completely determined by
input.
NFAs and DFAs recognize the same set of languages (regular languages). DFAs are
easier to implement – table driven. For a given language, the NFA can be simpler than
the DFA. DFA can be exponentially larger than NFA. NFAs are the key to automating
RE ? DFA construction.
RE ? NFA Construction
The algorithm for RE to DFA conversion is called Thompson’s Construction. The
algorithm appeared in CACM 1968. The algorithm builds an NFA for each RE term. The
NFAs are then combined using εmoves. The NFA is converted into a DFA using the
subset construction procedure. The number of states in the resulting DFA are minimized
using the Hopcroft’s algorithm.
Given a RE, we first create NFA pattern for each symbol and each operator. We then join
them with εmoves in precedence order. Here are examples of such constructions:
Sohail Aslam
Compiler Construction Notes
3
1. NFA for RE ab.
The following figures show the construction steps. NFAs for RE a and RE b are made.
These two are combined using an εmove; the NFA for RE a appears on the left and is
the source of the εtransition.
ε
s0
a
s1
NFA for a
s0
a s1
NFA for ab
s3
b s4
s3
b
s4
NFA for b
ε
s0
a s1
NFA for ab
s3
b s4
2. NFA for RE ab
ε
s0
s5
s1
a s2
NFA for a  b
s3
b s4
ε
ε
ε
3. NFA for RE a*
ε
s0
s4
s1
a s2
NFA for a*
ε
ε
ε
Sohail Aslam
Compiler Construction Notes
4
3. NFA for RE a ( bc )*
ε
s3
s9
s4
s5
s6
s7
s8
s0
s1 s2
a
ε
ε
ε
ε
ε
ε
ε
b
c
ε
Sohail Aslam
Compiler Construction Notes
1
Lecture 8
NFA ? DFA Construction
The algorithm is called subset construction. In the transition table of an NFA, each entry
is a set of states. In DFA, each entry is a single state. The general idea behind NFAto
DFA construction is that each DFA state corresponds to a set of NFA states. The DFA
uses its state to keep track of all possible states the NFA can be in after reading each
input symbol.
We will use the following operations.
• εclosure(T):
set of NFA states reachable from some NFA state s in T on ? transitions alone.
• move(T,a):
set of NFA states to which there is a transition on input a from some NFA state s
in set of states T.
Before it sees the first input symbol, NFA can be in any of the state in the set
εclosure(s0), where s0 is the start state of the NFA. Suppose that exactly the states in set
T are reachable from s0 on a given sequence of input symbols. Let a be the next input
symbol. On seeing a, the NFA can move to any of the states in the set move(T,a). Let a
be the next input symbol. On seeing a, the NFA can move to any of the states in the set
move(T,a). When we allow for εtransitions, NFA can be in any of the states in
εclosure(move(T,a)) after seeing a.
Subset Construction
Algorithm:
Input: NFA N with state set S, alphabet Σ, start state s0, final states F
Output: DFA D with state set S’, alphabet Σ, start states,
s0’ = εclosure(s0), final states F’ and transition table: S’ x Σ ? S’
// initially, eclosure(s0) is the only state in D states S’ and it is unmarked
s0’ = εclosure(s0)
S’ = {s0’ } (unmarked)
while (there is some unmarked state T in S’)
mark state T
for all a in Σ do
U = ? εclosure( move(T,a) );
if U not already in S’
add U as an unmarked state to S’
Dtran(T,a) = U;
end for
end while
Sohail Aslam
Compiler Construction Notes
2
F’:
for each DFA state S
if S contains an NFA final state
mark S as DFA final state
end algorithm
Example
Let us apply the algorithm to the NFA for (a  b )*abb. Σ is {a,b}.
0
ε
ε
ε
ε
ε
a
b
ε
ε
1
2
3
4
5
6
8
9
ε
7 a
b
b
10
The start state of equivalent DFA is εclosure(0), which is A = {0,1,2,4,7}; these are
exactly the states reachable from state 0 via εtransition. The algorithm tells us to mark A
and then compute εclosure(move(A,a))
move(A,a)), is the set of states of NFA that have transition on ‘a’ from members of A.
Only 2 and 7 have such transition, to 3 and 8. So, εclosure(move(A,a)) =
εclosure({3,8}) = {1,2,3,4,6,7,8}. Let B = {1,2,3,4,6,7,8}; thus Dtran[A,a] = B
For input b, among states in A, only 4 has transition on b to 5. Let C = εclosure({5}) =
{1,2,4,5,6,7}. Thus, Dtran[A,b] = C
We continue this process with the unmarked sets B and C, i.e., εclosure(move(B,a)),
εclosure(move(B,b)), εclosure(move(C,a)) and εclosure(move(C,b)) until all sets and
states of DFA are marked. This is certain since there are only 211 (!) different subsets of a
set of 11 states. A set, once marked, is marked forever. Eventually, the 5 sets are:
1. A={0,1,2,4,7}
2. B={1,2,3,4,6,7,8}
3. C={1,2,4,5,6,7}
4. D={1,2,4,5,6,7,9}
5. E={1,2,4,5,6,7,10}
A is start state because it contains state 0 and E is the accepting state because it contains
state 10.
Sohail Aslam
Compiler Construction Notes
3
The subset construction finally yields the following DFA
A
a
b
a
b
b
E
B
D
C
a
b
b
a
a
The Resulting DFA can be encoded as the following transition table
C
B
E
E
B
D
C
B
C
D
B
B
C
B
A
b
a
Input symbol
State
Sohail Aslam
Compiler Construction Notes
1
Lecture 9
DFA Minimization
The generated DFA may have a large number of states. The Hopcroft’s algorithm can be
used to minimize DFA states. The behind the algorithm is to find groups of equivalent
states. All transitions from states in one group G1 go to states in the same group G2.
Construct the minimized DFA such that there is one state for each group of states from
the initial DFA. Here is the minimized version of the DFA created earlier; states A and C
have been merged.
A,C
a
b
b
b
E
B
D
a
b
a
a
We can construct an optimized acceptor with the following structure:
input
string
RE
w
R
yes, if w ε L(R)
no, if w ε L(R)
RE=>NFA
NFA=>DFA
Min. DFA
Simulate
DFA
Lexical Analyzers
Lexical analyzers (scanners) use the same mechanism but they have multiple RE
descriptions for multiple tokens and have a character stream at the input. The lexical
analyzer returns a sequence of matching tokens at the output (or an error) and it always
return the longest matching token.
Sohail Aslam
Compiler Construction Notes
2
Lexical Analyzer Generators
The process of constructing a lexical analyzer can automated. We only need to specify
Regular expressions for tokens and rules for assigning priorities for multiple longest
match cases, e.g, “==” and “=”, “==” is longer.
Two popular lexical analyzer generators are
• Flex : generates lexical analyzer in C or C++. It is more modern version of the
original Lex tool that was part of the AT&T Bell Labs version of Unix.
• Jlex: written in Java. Generates lexical analyzer in Java
Using Flex
We will use for the projects in this course. To use Flex, one has to provide a specification
file as input to Flex. Flex reads this file and produces an output file contains the lexical
analyzer source in C or C++.
The input specification file consists of three sections:
C or C++ and flex definitions
%%
token definitions and actions
%%
user code
The symbols “%%” mark each section. A detailed guide to Flex is included in
supplementary reading material for this course. We will go through a simple example.
The following is the Flex specification file for recognizing tokens found in a C++
function. The file is named “lex.l”; it is customary to use the “.l” extension for Flex
input files.
%{
#include “tokdefs.h”
%}
D
[09]
L
[azAZ_]
id {L}({L}{D})*
%%
"void" {return(TOK_VOID);}
"int" {return(TOK_INT);}
"if" {return(TOK_IF);}
Specification File lex.l
"else" {return(TOK_ELSE);}
"while"{return(TOK_WHILE)};
"<=" {return(TOK_LE);}
Sohail Aslam
Compiler Construction Notes
3
">=" {return(TOK_GE);}
"==" {return(TOK_EQ);}
"!=" {return(TOK_NE);}
{D}+ {return(TOK_INT);}
{id} {return(TOK_ID);}
[\n][\t][ ] ;
%%
The file lex.l includes another file named “tokdefs.h”. The content of
tokdefs.h are
#define TOK_VOID 1
#define TOK_INT 2
#define TOK_IF 3
#define TOK_ELSE 4
#define TOK_WHILE 5
#define TOK_LE 6
#define TOK_GE 7
#define TOK_EQ 8
#define TOK_NE 9
#define TOK_INT 10
#define TOK_ID 111
Flex creates C++ classes that implement the lexical analyzer. The code for these classes
is placed in the Flex’s output file. Here, for example, is the code needed to invoke the
scanner; this is placed in main.cpp:
void main()
{
FlexLexer lex;
int tc = lex.yylex();
while(tc != 0) {
cout << tc << “,” <
}
}
The following commands can be used to generate a scanner executable file in windows.
flex lex.l
g++ –c lex.cpp
g++ –c main.cpp
g++ –o lex.exe lex.o main.o
Sohail Aslam
Compiler Construction Notes
1
Lecture 10
Running the Scanner
Here is the output of the scanner when executed and given the file main.cpp as input,
i.e., the scanner is being asked to provide tokens found in the file main.cpp:
lex
258,main
283,(
284,)
285,{
258,FlexLexer
258,lex
290,;
260,int
258,tc
266,=
258,lex
291,.
258,yylex
283,(
284,)
290,;
263,while
283,(
258,tc
276,!=
257,0
284,)
258,cout
279,<<
258,tc
279,<<
292,","
279,<<
258,lex
291,.
258,YYText
283,(
284,)
279,<<
258,endl
290,;
258,tc
266,=
258,lex
291,.
258,yylex
283,(
284,)
290,;
286,}
Sohail Aslam
Compiler Construction Notes
2
Flex input for C++
As an illustration of the power of Flex, here is the input for generating scanner for C++
compiler.
/*
* ISO C++ lexical analyzer.
* Based on the ISO C++ draft standard of December '96.
*/
%{
#include
#include
#include “tokdefs.h"
int lineno;
static int yywrap(void);
static void skip_until_eol(void);
static void skip_comment(void);
static int check_identifier(const char *);
%}
intsuffix
([uU][lL]?)([lL][uU]?)
fracconst
([09]*\.[09]+)([09]+\.)
exppart
[eE][+]?[09]+
floatsuffix [fFlL]
chartext
([^'])(\\.)
stringtext
([^"])(\\.)
%%
%%
"\n" { ++lineno; }
[\t\f\v\r ]+ { /* Ignore whitespace. */ }
"/*" { skip_comment(); }
"//" { skip_until_eol(); }
"{" { return '{'; }
"<%" { return '{'; }
"}" { return '}'; }
"%>" { return '}'; }
"[" { return '['; }
"<:" { return '['; }
"]" { return ']'; }
":>" { return ']'; }
"(" { return '('; }
")" { return ')'; }
";" { return ';'; }
":" { return ':'; }
"..." { return ELLIPSIS; }
"?" { return '?'; }
"::" { return COLONCOLON; }
"." { return '.'; }
".*" { return DOTSTAR; }
"+" { return '+'; }
"" { return ''; }
"*" { return '*'; }
"/" { return '/'; }
"%" { return '%'; }
"^" { return '^'; }
"xor" { return '^'; }
"&" { return '&'; }
Sohail Aslam
Compiler Construction Notes
3
"bitand" { return '&'; }
"" { return ''; }
"bitor" { return ''; }
"~" { return '~'; }
"compl" { return '~'; }
"!" { return '!'; }
"not" { return '!'; }
"=" { return '='; }
"<" { return '<'; }
">" { return '>'; }
"+=" { return ADDEQ; }
"=" { return SUBEQ; }
"*=" { return MULEQ; }
"/=" { return DIVEQ; }
"%=" { return MODEQ; }
"^=" { return XOREQ; }
"xor_eq" { return XOREQ; }
"&=" { return ANDEQ; }
"and_eq" { return ANDEQ; }
"=" { return OREQ; }
"or_eq" { return OREQ; }
"<<" { return SL; }
">>" { return SR; }
"<<=" { return SLEQ; }
">>=" { return SREQ; }
"==" { return EQ; }
"!=" { return NOTEQ; }
"not_eq" { return NOTEQ; }
"<=" { return LTEQ; }
">=" { return GTEQ; }
"&&" { return ANDAND; }
"and" { return ANDAND; }
"" { return OROR; }
"or" { return OROR; }
"++" { return PLUSPLUS; }
"" { return MINUSMINUS; }
"," { return ','; }
">*" { return ARROWSTAR; }
">" { return ARROW; }
"asm" { return ASM; }
"auto" { return AUTO; }
"bool" { return BOOL; }
"break" { return BREAK; }
"case" { return CASE; }
"catch" { return CATCH; }
"char" { return CHAR; }
"class" { return CLASS; }
"const" { return CONST; }
"const_cast" { return CONST_CAST; }
"continue" { return CONTINUE; }
"default" { return DEFAULT; }
"delete" { return DELETE; }
"do" { return DO; }
"double" { return DOUBLE; }
"dynamic_cast" { return DYNAMIC_CAST; }
"else" { return ELSE; }
"enum" { return ENUM; }
"explicit" { return EXPLICIT; }
"export" { return EXPORT; }
"extern" { return EXTERN; }
"false" { return FALSE; }
"float" { return FLOAT; }
"for" { return FOR; }
Sohail Aslam
Compiler Construction Notes
4
"friend" { return FRIEND; }
"goto" { return GOTO; }
"if" { return IF; }
"inline" { return INLINE; }
"int" { return INT; }
"long" { return LONG; }
"mutable" { return MUTABLE; }
"namespace" { return NAMESPACE; }
"new" { return NEW; }
"operator" { return OPERATOR; }
"private" { return PRIVATE; }
"protected" { return PROTECTED; }
"public" { return PUBLIC; }
"register" { return REGISTER; }
"reinterpret_cast" { return REINTERPRET_CAST; }
"return" { return RETURN; }
"short" { return SHORT; }
"signed" { return SIGNED; }
"sizeof" { return SIZEOF; }
"static" { return STATIC; }
"static_cast" { return STATIC_CAST; }
"struct" { return STRUCT; }
"switch" { return SWITCH; }
"template" { return TEMPLATE; }
"this" { return THIS; }
"throw" { return THROW; }
"true" { return TRUE; }
"try" { return TRY; }
"typedef" { return TYPEDEF; }
"typeid" { return TYPEID; }
"typename" { return TYPENAME; }
"union" { return UNION; }
"unsigned" { return UNSIGNED; }
"using" { return USING; }
"virtual" { return VIRTUAL; }
"void" { return VOID; }
"volatile" { return VOLATILE; }
"wchar_t" { return WCHAR_T; }
"while" { return WHILE; }
[azAZ_][azAZ_09]*
{ return check_identifier(yytext); }
"0"[xX][09afAF]+{intsuffix}? { return INTEGER; }
"0"[07]+{intsuffix}?
{ return INTEGER; }
[09]+{intsuffix}? { return INTEGER; }
{fracconst}{exppart}?{floatsuffix}? { return FLOATING; }
[09]+{exppart}{floatsuffix}?
{ return FLOATING; }
"'"{chartext}*"'" { return CHARACTER; }
"L'"{chartext}*"'" { return CHARACTER; }
"\""{stringtext}*"\""
{ return STRING; }
"L\""{stringtext}*"\""
{ return STRING; }
.
{ fprintf(stderr,
"%d: unexpected character `%c'\n", lineno,
yytext[0]); }
%%
static int
yywrap(void)
Sohail Aslam
Compiler Construction Notes
5
{
return 1;
}
static void
skip_comment(void)
{
int c1, c2;
c1 = input();
c2 = input();
while(c2 != EOF && !(c1 == '*' && c2 == '/'))
{
if (c1 == '\n')
++lineno;
c1 = c2;
c2 = input();
}
}
static void
skip_until_eol(void)
{
int c;
while ((c = input()) != EOF && c != '\n')
;
++lineno;
}
static int
check_identifier(const char *s)
{
/*
* This function should check if `s' is a
* typedef name or a class
* name, or a enum name, ... etc. or
* an identifier.
*/
switch (s[0]) {
case 'D': return TYPEDEF_NAME;
case 'N': return NAMESPACE_NAME;
case 'C': return CLASS_NAME;
case 'E': return ENUM_NAME;
case 'T': return TEMPLATE_NAME;
}
return IDENTIFIER;
}
Parsing
We now move the second module of the frontend: the parser. Recall the frontend
components:
scanner
parser
source
code
tokens
IR
errors
Sohail Aslam
Compiler Construction Notes
6
The parser checks the stream of words (tokens) and their parts of speech for grammatical
correctness. It determines if the input is syntactically well formed. It guides context
sensitive (“semantic”) analysis (type checking). Finally, it builds IR for source program.
Syntactic Analysis
Consider the sentence “He wrote the program”. The structure of the sentence can be
described using grammar syntax of English language.
He
wrote
the program
noun
verb
article noun
subject predicate
object
sentence
The analogy can be carried over to syntax of sentences in a programming language. For
example, an ifstatement has the syntax
if ( b <= 0 ) a = b
bool expr assignment
ifstatement
The parser ensures that sentences of a programming language that make up a program
abide by the syntax of the language. If there are errors, the parser will detect them and
reports them accordingly. Consider the following code segment that contains a number of
syntax errors:
int* foo(int i, int j))
{
for(k=0; i j; )
fi( i > j )
return j;
}
It is clear that a scanner based upon regular expressions will not be able to detect syntax
error.
Lecture 11
Syntactic Analysis
Consider the following C++ function. There are a number of syntax errors present.
1.
int* foo(int i, int j))
2.
{
3.
for(k=0; i j; )
4.
fi( i > j )
5.
return j;
6.
}
Line 1 has extra parenthesis at the end. The boolean expression in the for loop in line 3 is
incorrect. Line 4 has a missing semicolon at the end. All such errors are due to the fact
the function does not abide by the syntax of the C++ language grammar.
Semantic Analysis
Consider the English language sentence “He wrote the computer”. The sentence is
syntactically correct but semantically wrong. The meaning of the sentence is incorrect;
one does not “write” a computer. Issues related to meaning fall under the heading of
semantic analysis. The following C++ function has semantic errors. The type of the local
variable sum has not been declared. The returned value does not match the return value
type of the function (int*). The function is syntactically correct.
int* foo(int i, int j)
{
for(k=0; i < j; j++ )
if( i < j2 )
sum = sum+i
return sum;
}
Role of the Parser
Not all sequences of tokens are program. Parser must distinguish between valid and
invalid sequences of tokens. What we need is an expressive way to describe the syntax of
programs and an acceptor mechanism that determines if input token stream satisfies the
syntax of the programming language. The acceptor mechanism determines if input token
stream satisfies the syntax of a programming language.
Parsing is the process of discovering a derivation for some sentence of a language. The
mathematical model of syntax is represented by a grammar G. The language generated
by the grammar is indicated by L(G). Syntax of most programming languages can be
represented by Context Free Grammars (CFG).
A CFG is a four tuple G=(S,N,T,P)
1. S is the start symbol
2. N is a set of nonterminals
3. T is a set of terminals
4. P is a set of productions
Why aren’t Regular Expressions used to represent syntax? The reasons is that regular
languages do not have enough power to express syntax of programming languages.
Moreover, finite automaton can’t remember number of times it has visited a particular
state.
Consider the following example of CFG
SheepNoise ? SheepNoise baa
 baa
This CFG defines the set of noises sheep make. We can use the SheepNoise grammar to
create sentences of the language. We use the productions as rewriting rules
Rule
Sentential Form

SheepNoise
1
SheepNoise baa
1
SheepNoise baa baa
2
baa baa baa
While it is cute, this example quickly runs out intellectual steam. To explore uses of
CFGs, we need a more complex grammar. Consider the grammar for arithmetic
expressions:
1 expr
?
expr op expr
2

num
3

id
4 op
?
+
5

–
6

*
7

/
Grammar rules in a similar form were first used in the description of the Algol60
programming language. The syntax of C, C++ and Java is derived heavily from Algol60.
The notation was developed by John Backus and adapted by Peter Naur for the Algol60
language report; thus the term BackusNaur Form (BNF)
Let us use the expression grammar to derive the sentence
x – 2 * y
Rule
Sentential Form

expr
1
expr op expr
2
5
1
2
6
3
Such a process of rewrites is called a derivation and the process or discovering a
derivation is called parsing. At each step, we choose a nonterminal to replace. Different
choices can lead to different derivations.
Two derivations are of interest
1. Leftmost: replace leftmost nonterminal (NT) at each step
2. Rightmost: replace rightmost NT at each step
The example on the preceding slides was leftmost derivation. There is also a rightmost
derivation. In both cases we have
expr ? * id – num ∗ id
The two derivations produce different parse trees. The parse trees imply different
evaluation orders!
Lecture 12
Parse Trees
The derivations can be represented in a treelike fashion. The interior nodes contain the
nonterminals used during the derivation
Precedence
These two derivations point out a problem with the grammar. It has no notion of
precedence, or implied order of evaluation. The normal arithmetic rules say that
multiplication has higher precedence than subtraction. To add precedence, create a non
G
E
E
op
E
E
op
E
x
–
2
*
y
Leftmost
derivation
evaluation order
x – ( 2 * y )
G
E
op
E
x
–
E
E
op
E
2
*
y
Rightmost
derivation
evaluation order
(x – 2 ) * y
terminal for each level of precedence. Isolate corresponding part of grammar to force
parser to recognize high precedence subexpressions first. Here is the revised grammar:
This grammar is larger and requires more rewriting to reach some of the terminal
symbols. But it encodes expected precedence. Let’s see how it parses
x – 2 * y
Rule Sentential Form
 Goal
1 Expr
3
expr – term
5
expr – term ∗ factor
9
expr – term ∗
7
expr – factor ∗
8
expr –
4
term –
7
factor –
9
This produces same parse tree under leftmost and rightmost derivations
Id

9
number
?
factor
8
factor

7
term / factor

6
term ∗ factor
?
term
5
term

4
expr – term

3
expr + term
?
expr
2
expr
?
Goal
1
level
two
level
one
Both leftmost and rightmost derivations give the same expression because the grammar
directly encodes the desired precedence.
Ambiguous Grammars
If a grammar has more than one leftmost derivation for a single sentential form, the
grammar is ambiguous. The leftmost and rightmost derivations for a sentential form may
differ, even in an unambiguous grammar. Let’s consider the classic ifthenelse example
Stmt ? if Expr then Stmt
 if Expr then Stmt else Stmt
 … other stmts ….
The following sentential form has two derivations:
if E1 then if E2 then S else S2
G
E
F
T
T
F

–
*
T
E
T
evaluation order
x – ( 2 * y )
The convention in most programming languages is to match the else with the most recent
if.
Production 1, then
Production 2:
if E1 then
if E2 then S1
else S2
E1
if
then
if
then
else
S1
S2
E2
E1
if
then
if
then
else
S1
S2
E2
Production 2, then
Production 1:
if E1 then
if E2 then S1
else S2
We can rewrite grammar to avoid generating the problem and match each else to
innermost unmatched if:
1. Stmt
?
If E then Stmt
2.

If E then WithElse else Stmt
3.
 Assignment
4. WithElse ? If E then WithElse else WithElse
5.
 Assignment
Let derive the following using the rewritten grammar:
if E1 then if E2 then A1 else A2
ContextFree Grammars
We have been using the term context free without explaining why such rules are in fact
“free of context”. The simple reason is that nonterminals appear by themselves to the left
of the arrow in contextfree rules:
A ? α
The rule A ? α says that A may be replaced by α anywhere, regardless of where A
occurs. On the other hand, we could define a context as pair of strings β , γ , such that a
rule would apply only if β occurs before and γ occurs after the nonterminal A. We would
write this as
β A γ ? β α γ
This binds the else controlling A2 to inner if
E1
Stmt
then
else
A1
A2
E2
if Expr
Stmt
Stmt
then
if Expr
Withelse
Such a rule in which α ? ε is called a contextsensitive grammar rule. We would write
this as
β A γ ? β α γ
Such a rule in which α ? ε is called a contextsensitive grammar rule
Parsing Techniques
There are two primary parsing techniques: topdown and bottomup.
Topdown parsers
A topdown parsers starts at the root of the parse tree and grows towards leaves. At each
node, the parser picks a production and tries to match the input. However, the parser may
pick the wrong production in which case it will need to backtrack. Some grammars are
backtrackfree.
Bottomup parsers
A bottomup parser starts at the leaves and grows toward root of the parse tree. As input
is consumed, the parser encodes possibilities in an internal state. The bottomup parser
starts in a state valid for legal first tokens. Bottomup parsers handle a large class of
grammars
Lecture 13
TopDown Parser
A topdown parser starts with the root of the parse tree. The root node is labeled with the
goal (start) symbol of the grammar. The topdown parsing algorithm proceeds as follows:
1. Construct the root node of the parse tree
2. Repeat until the fringe of the parse tree matches input string
a. At a node labeled A, select a production with A on its lhs
b. for each symbol on its rhs, construct the appropriate child
c. When a terminal symbol is added to the fringe and it does not match the
fringe, backtrack
The key is picking right production in step a. That choice should be guided by the input
string. Let’s try parsing using this algorithm using the expression grammar.
x – 2 * y
P Sentential Form
input
 Goal
↑x – 2 * y
1 expr
↑x – 2 * y
2 expr + term
↑x – 2 * y
4
term + term
↑x – 2 * y
7
factor + term
↑x – 2 * y
9
↑x – 2 * y
9
x ↑– 2 * y
This worked well except that “–” does not match “+”. The parser made the wrong choice
of production to use at step 2. The parser must backtrack and use a different production.
This time the “–” and “–” matched. We can advance past “–” to look at “2”. Now, we
need to expand “term”
P Sentential Form
input
 Goal
↑x – 2 * y
1 expr
↑x – 2 * y
3 expr – term
↑x – 2 * y
4
term – term
↑x – 2 * y
7
factor – term
↑x – 2 * y
9
↑x – 2 * y
9
x ↑– 2 * y
The 2’s match but the expansion terminated too soon because there is still unconsumed
input and there are no nonterminals to expand in the sentential form ⇒ Need to
backtrack.
P Sentential Form
input

x – ↑2 * y
5
x – ↑2 * y
7
x – ↑2 * y
8


9

This time the parser met with success. All of the input matched.
Left Recursion
Consider another possible parse:
Parser is using productions but no input is being consumed.
x – 2 ↑* y

x – ↑2 * y
9
x – ↑2 * y
7
x – ↑2 * y

Sentential Form
input
P
↑x – 2 * y
expr +term +term +term +....
2
↑x – 2 * y
expr +term +term +term
2
↑x – 2 * y
expr +term +term
2
expr +term
expr
Goal
Sentential Form
↑x – 2 * y
2
↑x – 2 * y
1
↑x – 2 * y

input
P
Topdown parsers cannot handle leftrecursive grammars. Formally, a grammar is left
recursive if ∃ A ∈ NT such that ∃ a derivation A ⇒* A α , for some string
α ∈ (NT ∪ T)*.
Our expression grammar is left recursive. This can lead to nontermination in a topdown
parser. Nontermination is bad in any part of a compiler! For a topdown parser, any
recursion must be a right recursion. We would like to convert left recursion to right
To remove left recursion, we can transform the grammar. Consider a grammar fragment:
A ? A α
 β
where neither α nor β starts with A . We can rewrite this as:
A ? β A'
A' ? α A'
 ε
where A' is a new nonterminal. This grammar accepts the same language but uses only
right recursion. The expression grammar we have been using contains two cases of left
recursion. Applying the transformation yields
expr
?
term expr'
expr'
?
+ term expr'

– term expr'
term
?
factor term'
term'
?
* factor term'

/ factor term'

ε
These fragments use only right recursion. They retain the original left associativity. A
topdown parser will terminate using them.
Predictive Parsing
If a top down parser picks the wrong production, it may need to backtrack.
Alternative is to lookahead in input and use context to pick the production to use
correctly. How much lookahead is needed? In general, an arbitrarily large amount of
lookahead symbols are required.. Fortunately, large classes of CFGs can be parsed with
limited lookahead. Most programming languages constructs fall in those subclasses
The basic idea in predictive parsing is: given A ? α  β , the parser should be able to
choose between α and β . To accomplish this, the parser needs FIRST and FOLLOW sets.
Definition: FIRST sets: for some rhs α ∈ G, define FIRST(α ) as the set of tokens that
appear as the first symbol in some string that derives from α . That is, x ∈ FIRST(α ) iff
α ⇒∗ x γ, for some γ .
Lecture 14
The LL(1) Property
If A ? α and A ? β both appear in the grammar, we would like
FIRST(α ) ∩ FIRST(β) = ∅
Predictive parsers accept LL(k) grammars. The two LL stand for Lefttoright scan of
input, leftmost derivation. The k stands for number of lookahead tokens of input. The
LL(1) Property allows the parser to make a correct choice with a lookahead of exactly
one symbol! What about εproductions? They complicate the definition of LL(1).
If A ? α and A ? β and ε ∈ FIRST(α) , then we need to ensure that FIRST(β ) is
disjoint from FOLLOW(α ), too.
Definition: FOLLOW(α) is the set of all words in the grammar that can legally appear
after an α.
For a nonterminal X, FOLLOW(X ) is the set of symbols that might follow the
derivation of X. Define FIRST+(α) as FIRST(α) ∪ FOLLOW(α), if ε ∈ FIRST(α),
FIRST(α), otherwise. Then a grammar is LL(1) iff A ? α and A ? β implies
FIRST+(α ) ∩ FIRST+(β) = ∅
Given a grammar that has the is LL(1) property, we can write a simple routine to
recognize each lhs. The code is simple and fast. Consider
A ? β1  β2  β3 ,
which satisfies the LL(1) property FIRST+(α )∩FIRST+(β) = ∅
/* find an A */
if(token ∈ FIRST(β1))
find a β1 and return true
else if(token ∈ FIRST(β 2))
find a β2 and return true
if(token ∈ FIRST(β3))
find a β3 and return true
else error and return false
Grammar with the LL(1) property are called predictive grammars because the parser can
“predict” the correct expansion at each point in the parse. Parsers that capitalize on the
LL(1) property are called predictive parsers. One kind of predictive parser is the
recursive descent parser.
Recursive Descent Parsing
Consider the rightrecursive expression grammar
1 Goal
? expr
2 expr
?
term expr'
3 expr'
?
+ term expr'
4

 term expr'
5

ε
6 term
?
factor term'
7 term'
?
* factor term'
8

/ factor term'
9

ε
10 factor
?
number
11

id
12

( expr )
This leads to a parser with six mutually recursive routines: goal, expr, eprime,
term, tprime and factor. Each recognizes one nonterminal (NT) or terminal (T).
The term descent refers to the direction in which the parse tree is built. Here are some of
these routines written as functions:
Goal() {
token = next_token();
if(Expr() == true && token == EOF)
next compilation step
else {
report syntax error;
return false;
}
}
Expr()
{
if(Term() == false)
return false;
else
return Eprime();
}
Eprime() {
token_type op = next_token();
if( op == PLUS  op == MINUS ) {
if(Term() == false)
return false;
else
return Eprime();
}
}
Functions for other nonterminals Term, Factor and Tprime follow the same pattern.
Recursive Descent in C++
This form of the routines is too procedural. Moreover, there is no convenient way to build
the parse tree. We can use C++ to code the recursive descent parser in an object oriented
manner. We associate a C++ class with each nonterminal symbol. An instantiated object
of a nonterminal class contains pointer to the parse tree. Here are the C++ code for the
nonterminal classes:
class NonTerminal {
public:
NonTerminal(Scanner* sc)
{
s = sc; tree = NULL;
}
virtual ~NonTerminal(){}
virtual bool isPresent()=0;
TreeNode* AST(){
return tree;
}
protected:
Scanner* s;
TreeNode* tree;
}
class Expr:public NonTerminal
{
public:
Expr(Scanner* sc): NonTerminal(sc){ }
virtual bool isPresent();
}
class Eprime:public NonTerminal {
public:
Eprime(Scanner* sc, TreeNode* t): NonTerminal(sc)
{
exprSofar = t;
}
virtual bool isPresent();
protected:
TreeNode* exprSofar;
}
class Term:public NonTerminal
{
public:
Term(Scanner* sc): NonTerminal(sc){ }
virtual bool isPresent();
}
class Tprime:public NonTerminal {
public:
Tprime(Scanner* sc, TreeNode* t): NonTerminal(sc)
{
exprSofar = t;
}
virtual bool isPresent();
protected:
TreeNode* exprSofar;
}
class Factor:public NonTerminal {
public:
Factor(Scanner* sc, TreeNode* t): NonTerminal(sc){ };
virtual bool isPresent();
}
Lecture 15
Let’s consider the implementation of the C++ classes for the nonterminals. We start with
Expr.
bool Expr::isPresent()
{
Term* op1 = new Term(s);
if(!op1>isPresent())
return false;
tree = op1>AST();
Eprime* op2 = new Eprime(s, tree);
if(op2>isPresent())
tree = op2>AST();
return true;
}
bool Eprime::isPresent()
{
int op=s>nextToken();
if(op==PLUS  op==MINUS){
s>advance();
Term* op2=new Term(s);
if(!op2>isPresent())
syntaxError(s);
TreeNode* t2=op2>AST();
tree = new TreeNode(op,exprSofar,t2);
Eprime* op3 = new Eprime(s, tree);
if(op3>isPresent())
tree = op3>AST();
return true;
}
else return false;
}
bool Term::isPresent()
{
Factor* op1 = new Factor(s);
if(!op1>isPresent())
return false;
tree = op1>AST();
Tprime* op2 = new Tprime(s, tree);
if(op2>isPresent())
tree = op2>AST();
return true;
}
bool Tprime::isPresent()
{
int op=s>nextToken();
if(op == MUL  op == DIV){
s>advance();
Factor* op2=new Factor(s);
if(!op2>isPresent())
syntaxError(s);
TreeNode* t2=op2>AST();
tree = new TreeNode(op,exprSofar,t2);
Tprime* op3 = new Tprime(s, tree);
if(op3>isPresent())
tree = op3>AST();
return true;
}
else return false;
}
bool Factor::isPresent()
{
int op=s>nextToken();
if(op == ID  op == NUM)
{
tree = new TreeNode(op,s>tokenValue());
s>advance();
return true;
}
if( op == LPAREN ){
s>advance();
Expr* opr = new Expr(s);
if(!opr>isPresent() )
syntaxError(s);
if(s>nextToken() != RPAREN)
syntaxError(s);
s>advance();
tree = opr>AST();
return true;
}
return false;
}
Lecture 16
Here is the output trace for the expression : 2+4*6
>> Expr::isPresent()
>> Term::isPresent()
>> Factor::isPresent()
token: 2 (257)
<< Factor::isPresent() return true
>> Tprime::isPresent()
token: + (267)
<< Tprime::isPresent() return false
<< Term::isPresent() return true
>> Eprime::isPresent()
token: + (267)
>> Term::isPresent()
>> Factor::isPresent()
token: 4 (257)
<< Factor::isPresent() return true
>> Tprime::isPresent()
token: * (269)
>> Factor::isPresent()
token: 6 (257)
<< Factor::isPresent() return true
>> Tprime::isPresent()
token: (0)
<< Tprime::isPresent() return false
<< Tprime::isPresent() return true
<< Term::isPresent() return true
>> Eprime::isPresent()
token: (0)
<< Eprime::isPresent() return false
<< Eprime::isPresent() return true
<< Expr::isPresent() return true
** AST **
(2+(4*6))
4
6
*
2
+
Nonrecursive Predictive Parsing
It is possible to build a nonrecursive predictive parser. This is done by maintaining an
explicit stack and using a table. Such a parser is called a tabledriven parser. The non
recursive LL(1) parser looks up the production to apply by looking up a parsing table.
The LL(1) table has one dimension for current nonterminal to expand and another
dimension for next token. Each table cell contains one production.
Consider the expression grammar
1 E ? T E'
2 E' ?
+ T E'
3

ε
4 T ? F T'
5 T' ?
* F T'
6

ε
7 F ?
( E )
8

id
Using the table construction algorithm that will be discussed later, we have the predictive
parsing table
id
+
*
(
)
$
E E ? TE'
E ? TE'
E'
E' ? +TE'
E' ? ε E' ? ε
T T ? FT'
T ? FT'
T'
T' ? ε
T ? *FT'
T' ? ε T' ? ε
F F ? id
F ? (E )
The rows are nonterminals and the columns are the terminals of the expression grammar.
The predictive parser uses an explicit stack to keep track of pending nonterminals. It can
thus be implemented without recursion.
a + b $
Predictive
parser
stack
X
Y
Z
$
Parsing table M
input
output
LL(1) Parsing Algorithm
The input buffer contains the string to be parsed; $ is the endof input marker. The stack
contains a sequence of grammar symbols. Initially, the stack contains the start symbol of
the grammar on the top of $. The parser is controlled by a program that behaves as
follows:
The program considers X, the symbol on top of the stack, and a, the current input symbol.
These two symbols, X and a determine the action of the parser. There are three
possibilities.
1. X = a = $, the parser halts and annouces successful completion.
2. X = a ≠ $ the parser pops X off the stack and advances input pointer to next input
symbol.
3. If X is a nonterminal, the program consults entry M[X,a] of parsing table M.
a. If the entry is a production M[X,a] = {X ? UVW }, the parser replaces X on
top of the stack by WVU (with U on top). As output, the parser just prints the
production used: X ? UVW. However, any other code could be executed
here.
b. If M[X,a] =error, the parser calls an error recovery routine
Example: let’s parse the input string
id+id∗id
using the nonrecursive LL(1) parser
Stack
Input
Ouput
$E
id+id∗id$
$E' T
id+id∗id$ E ? TE'
$E' T' F
id+id∗id$ T ? FT'
$E'T' id
id+id∗id$ F ? id
$E' T'
+id∗ id$
$E'
+id∗ id$ T' ? ε
$E' T +
+id∗ id$ E' ? +TE'
$E' T
id∗ id$
$E' T' F
id∗ id$ T ? FT'
$E' T' id
id∗ id$ F ? id
$E' T'
∗id$
$E' T' F ∗
∗id$ T ? ∗FT'
$E' T' F
id$
$E'T' id
id$ F ? id
$E' T'
$
$E'
$ T' ? ε
$
$ E' ? ε
Lecture 17
Note that productions output are tracing out a lefmost derivation. The grammar symbols
on the stack make up leftsentential forms.
LL(1) Table Construction
Topdown parsing expands a parse tree from the start symbol to the leaves. It always
expand the leftmost nonterminal. Consider the state
S ? ∗ βAγ
with b the next token and we are trying to match βbγ. There are two possibilities
1. b belongs to an expansion of A.
Any A ? α can be used if b can start a string derived from α. In this case we say
that b ∈ FIRST(α )
2. b does not belong to an expansion of A . Expansion of A is empty, i.e., A ? ε
and b belongs an expansion of γ , e.g., bω. which means that b can appear after A
in a derivation of the form S ? ∗ βAbω. We say that b ∈ FOLLOW(A).
Any A ? α can be used if α expands to ε. We say that ε ∈ FIRST(A) in this case.
Definition
FIRST(X) = { b  X ? ∗ ba } ∪ { ε  X ? ∗ ε }
Lecture 18
Computing FIRST Sets
Here is the algorithm for computing the FIRST sets.
1. For all terminal symbols b,
FIRST(b) = {b}
2. For all productions: X ? A1... An
Add FIRST(A 1) − {ε} to FIRST(X), stop if ε ∉FIRST(A1)
Add FIRST(A 2) − {ε} to FIRST(X), stop if ε ∉FIRST(A2)
.....
Add FIRST(A n) − {ε} to FIRST(X), stop if ε ∉FIRST(A n)
Add ε to FIRST(X)
This strategy is encoded in the following procedure
for each a ∈ ( T ∪ ε )
FIRST(a) ? {a}
for each A ∈ NT
FIRST(A) ? Ø
while ( FIRST sets are still changing)
for each A ? β1 β2... βk ∈ P
FIRST(A) ? FIRST(A) ∪ (FIRST(β1) – {ε})
i ? 1
while ( ε ∈ FIRST(β i) and i = k1 )
FIRST(A) ? FIRST(A) ∪ (FIRST(β i+1) – {ε})
i ? i+1
if (i == k and ε ∈ FIRST (βk)
FIRST(A) ? FIRST(A) ∪ {ε}
Example : consider the expression grammar again
1 E ? T E'
2 E' ? + T E'
3
 ε
4 T ? F T'
5 T' ? * F T'
6
 ε
7 F ? ( E )
8

Id
FIRST(id) = { id }
FIRST('(' ) = { ( }
FIRST(‘+' ) = { + }
FIRST(E ) = {FIRST(T ) − {ε} }
FIRST(T ) = {FIRST(F ) − {ε} }
FIRST(F ) = {FIRST( '(') − {ε} } = { ( }
FIRST(F ) = { '(' } + {FIRST( id ) − {ε} } = { ( , id}
FIRST(E' ) = { +, ε }
FIRST(T' ) = { ∗, ε }
Thus,
FIRST(E ) = FIRST(T ) = FIRST(F ) = { (, id }
FIRST(E' ) = { +, ε }
FIRST(T' ) = { ∗, ε }
FOLLOW Sets
Definition:
FOLLOW(X) = { b  S ? ∗ βXbω}
Computing FOLLOW Sets:
1. Add $ to FOLLOW(S) where S is the start nonterminal.
2. If there is a production A ? αBβ , then everything in FIRST(β) – {ε} is in
FOLLOW(B).
3. If there is a production A ? αB, or A ? αBβ , where in ε ∈ FIRST(β)
(i.e., β ? ∗ ε), then everything in FOLLOW(A) is in FOLLOW(B)
The following procedure encodes this strategy.
for each A ∈ NT
FOLLOW(A) ? Ø
FOLLOW(S) ? {$}
while ( FOLLOW sets are still changing)
for each A ? β1 β2... βk ∈ P
FOLLOW(βk) ? FOLLOW(βk) ∪ FOLLOW(A)
T ? FOLLOW(A)
for i ? k downto 2
if ( ε ∈ FIRST (β i) )
FOLLOW(β i1) ? FOLLOW(β i1) ∪ (FIRST(β i) – {ε}) ∪ T
else
FOLLOW(β i1) ? FOLLOW(β i1) ∪FIRST(β i)
T ? Ø
Let’s apply the algorithm to the expression grammar.
Put $ in FOLLOW(E ). By rule (2) applied to production ? ( E ), ‘)’ is also in
FOLLOW(E ). Thus, FOLLOW(E ) = { ), $}. By rule (3) applied to production
E ? T E' , $ and ‘)’ are in FOLLOW(E' ). Thus, FOLLOW(E ) = FOLLOW (E' )
= { ), $ }. Similarly, FOLLOW(T ) = FOLLOW (T' ) = { +, ), $ } and
FOLLOW(F ) = { +, ∗, ), $ }
Sohail Aslam
Compiler Construction Notes Set:21
Lecture 19
LL(1) Table Construction
Here now is the algorithm to construct a predictive parsing table.
1. For each production A ? α
1. for each terminal a in FIRST(α), add A ? α to M[A,a].
2. If ε is in FIRST(α), add A ? α to M[A,b] for each terminal b in FOLLOW(A). If ε
is in FIRST(α), and $ is in FOLLOW(A), add A ? α to M[A,$].
2. Make each undefined entry of M be error.
Let us apply the algorithm to the expression grammar. Since FIRST(TE ') = FIRST(T )
= { (, id }, the production E ? TE' cause M[E,(] and M[E,id] to get E ? TE'. The
production E' ? +TE' causes M[E',+] to get E' ? +TE'. The production E' ? ε causes
M[E',)] and M[E',$] to get E' ? ε since FOLLOW (E' ) = { ), $ }. And so on. The final
parsing table produced is:
id
+
*
(
)
$
E E ? TE'
E ? TE'
E'
E' ? +TE'
E' ? ε E' ? ε
T T ? FT'
T ? FT'
T'
T' ? ε
T ? *FT'
T' ? ε T' ? ε
F F ? id
F ? (E )
Left Factoring
Consider the grammar
E ? T + E  T
T ? int  int∗ T  (E)
It is impossible to predict because for T, two productions start with int. For E, it is not
clear how to predict; the two productions start with the nonterminal T. A grammar must
be left factored before use for predictive parsing. The procedure to leftfactor a grammar
is as follows:
Sohail Aslam
Compiler Construction Notes Set:22
If α ≠ ε, replace all productions
A ? αβ1  αβ2  ....  αβn  γ
with
A ? α Z  γ
Z ? β1 β2 ....  βn
where Z is a new nonterminal
A graphical explanation:
Example: consider following fragment of expression grammar
Factor ?
id

id [ ExprList ]

id ( ExprList )
After left factoring, the grammar becomes
Factor ? id Args
Args ? [ ExprList ]
 ( ExprList )
 ε
Given a CFG that does not meet the LL(1) condition, it is undecidable whether or not an
equivalent LL(1) grammar exists.
A
αβ2
αβ1
αβ3
β2
β1
β3
αZ
A
Sohail Aslam
Compiler Construction Notes Set:363
63
Lecture 20
Bottomup Parsing
Bottomup parsing is more general than topdown parsing. Bottomup parsers handle a
large class of grammars. It is the preferred method in practice. It is also called LR
parsing; L means that tokens are read left to right and R means that the parser constructs a
rightmost derivation. LR parsers do not need leftfactored grammars. LR parsers can
handle leftrecursive grammars.
LR parsing reduces a string to the start symbol by inverting productions.A derivation
consists of a series of rewrite steps
S ⇒ γ0 ⇒ γ1 ⇒ ... ⇒ γn1
⇒ γn ⇒ sentence
Each γi is a sentential form. If γ contains only terminals, γ is a sentence in L(G). If γ
contains ≥ 1 nonterminals, γ is a sentential form. A bottomup parser builds a derivation
by working from input sentence back towards the start symbol S.
Consider the grammar
S →
aABe
A →
Abc  b
B →
d
The sentence abbcde can be reduced to S:
abbcde
aAbcde
aAde
aABe
S
These reductions, in fact, trace out the following rightmost derivation in reverse:
S ⇒ aABe
⇒ aAde
⇒ aAbcde
⇒ abbcde
S ⇒ aBy ⇒ aγy ⇒ xy
Sohail Aslam
Compiler Construction Notes Set:364
64
Consider the grammar
1.
E → E + (E)
2.
 int
The bottomup parse of the string int + (int) + (int) would be
int + (int) + (int)
E + (int) + (int)
E + (E) + (int)
E + (int)
E + (E)
E
The consequence of an LR parser tracing a rightmost derivation in reverse is that given
αβγ be a step of a bottomup parse, assuming that next reduction is A → β τhen γ is a
string of terminals. The reason is that αAγ → αβγ is a step in a rightmost derivation. This
observation provides a strategy for building bottom up parsers: split the input string into
two substrings. Right substring (a string of terminals) is as yet unexamined by parser and
left substring has terminals and nonterminals. The dividing point is marked by a ► (the
► is not part of the string). Initially, all input is unexamined: ►x1 x1 . . . xn.
ShiftReduce Parsing
Bottomup parsing uses only two kinds of actions:
1. Shift
2. Reduce
α
x
y
B
S
γ
rule: B → γ
Terminals only
Sohail Aslam
Compiler Construction Notes Set:365
65
Shift moves ► one place to the right which shifts a terminal to the left string
E + (► int) ⇒ E + (int ►)
In the reduce action, the parser applies an inverse production at the right end of the left
string. If E → E + (E) is a production, then
E + ( E+(E)►) ⇒ E + ( E ►)
ShiftReduce Example
►int + (int) + (int) $
shift
int ► + (int) + (int) $
reduce E → int
E ► + (int) + (int) $
shift 3 times
E + (int ►) + (int) $
reduce E → int
E + (E ►) + (int) $
shift
E + (E) ► + (int) $
reduce E → E+(E)
E ► + (int) $
shift 3 times
E + (int ►) $
reduce E → int
E + (E ►) $
shift
E + (E) ► $
red E → E+(E)
E ► $
accept
Sohail Aslam
Compiler Construction Notes Set:365
65
Lecture 21
ShiftReduce: The Stack
A stack can be used to hold the content of the left string. The Top of the stack is marked
by the ► symbol. The shift action pushes a terminal on the stack. Reduce pops zero or
more symbols from the stack (production rhs) and pushes a nonterminal on the stack
(production lhs)
Discovering Handles
A bottomup parser builds the parse tree starting with its leaves and working toward its
root. The upper edge of this partially constructed parse tree is called its upper frontier. At
each step, the parser looks for a section of the upper frontier that matches righthand side
of some production. When it finds a match, the parser builds a new tree node with the
production’s lefthand nonterminal thus extending the frontier upwards towards the root.
The critical step is developing an efficient mechanism that finds matches along the tree’s
current frontier.
Formally, the parser must find some substring β, of the upper frontier where
1. β is the righthand side of some production A → β, and
2. A → β is one step in rightmost derivation of input stream
We can represent each potential match as a pair 〈A→β,k〉, where k is the position on the
tree’s current frontier of the rightend of β. The pair 〈A→β,k〉, is called the handle of the
bottomup parse.
Handle Pruning
A bottomup parser operates by repeatedly locating handles on the frontier of the partial
parse tree and performing reductions that they specify. The bottomup parser uses a stack
to hold the frontier. The stack simplifies the parsing algorithm in two ways.
First, the stack trivializes the problem of managing space for the frontier. To extend the
frontier, the parser simply pushes the current input token onto the top of the stack.
Second, the stack ensures that all handles occur with their right end at the top of the
stack. This eliminates the need to represent handle’s position.
Sohail Aslam
Compiler Construction Notes Set:366
66
ShiftReduce Parsing Algorithm
push $ onto stack
sym ← nextToken()
repeat until (sym == $ and the stack contains exactly Goal on top of $)
if a handle for A → β on top of stack
pop β symbols off the stack
push A onto the stack
else if (sym ≠ $ )
push sym onto stack
sym ← nextToken()
else /* no handle, no input */
report error and halt
Sohail Aslam
Compiler Construction Notes Set:367
67
Lecture 22
Example: here is the bottomup parser’s action when it parses the expression grammar
sentence
x – 2 × y
(tokenized as id – num * id)
word Stack
Handle
Action
1
id ►
 none 
shift
2 –
id ►
〈Factor → id,1〉
reduce
3 –
Factor ►
〈Term → Factor,1〉
reduce
4 –
Term ►
〈Expr → Term,1〉
reduce
5 –
Expr ►
 none 
shift
6 num Expr – ►
 none 
shift
7 ×
Expr – num ►
〈Factor → num,3〉
reduce
8 ×
Expr – Factor ►
〈Term → Factor,3〉
shift
9 ×
Expr – Term ►
 none 
shift
10 id
Expr – Term × ►
 none 
shift
11 $
Expr – Term × id►
〈Factor → id,5〉
reduce
12 $
Expr–Term × Factor►
〈Term→Term×Factor,5〉
reduce
13 $
Expr – Term ►
〈Expr → Expr – Term,3〉
reduce
14 $
Expr ►
〈Goal → Expr,1〉
reduce
15 $
Goal
 none 
accept
Handles
The handlefinding mechanism is the key to efficient bottomup parsing. As it process an
input string, the parser must find and track all potential handles. For example, every legal
input eventually reduces the entire frontier to grammar’s goal symbol. Thus,
〈Goal → Expr,1〉 is a potential handle at the start of every parse. As the parser builds a
derivation, it discovers other handles. At each step, the set of potential handles represent
different suffixes that lead to a reduction. Each potential handle represent a string of
grammar symbols that, if seen, would complete the righthand side of some production.
Sohail Aslam
Compiler Construction Notes Set:368
68
For the bottomup parse of the expression grammar string, we can represent the potential
handles that the shiftreduce parser should track. Using the placeholder • to represent top
of the stack, there are nine handles:
Handles
1 〈Factor → id •〉
2 〈Term → Factor•〉
3 〈Expr → Term •〉
4 〈Factor → num•〉
5 〈Term → Factor•〉
6 〈Factor → id •〉
7 〈Term → Term × Factor •〉
8 〈Expr → Expr – Term •〉
9 〈Goal → Expr •〉
This notation shows that the second and fifth handles are identical, as are first and sixth.
It also create a way to represent the potential of discovering a handle in future. Consider
the parser’s state in step 6: The parser has recognized Expr –. Using the stackrelative
notation, we can represent the parser’s state as Expr → Expr – • Term. The parser has
already recognized an Expr and a –. If the parser reaches a state where it shifts a Term
on top of Expr and –, it will complete the handle Expr → Expr – Term •. How many
potential handles must the parser recognize? The righthand side of each production can
have a placeholder at its start, at its end and between any two consecutive symbols.
Expr → • Expr –Term
Expr → Expr • – Term
Expr → Expr – •Term
Expr → Expr – Term •
If the righthand side of a production has k symbols, it has k +1 placeholder positions.
Sohail Aslam
Compiler Construction Notes Set:21
Lecture 23
Handles
The number of potential handles for the grammar is simply the sum of the lengths of the
righthand side of all the productions. The number of complete handles is simply the
number of productions. These two facts lead to the critical insight behind LR parsers:
A given grammar generates a finite set of handles (and potential handles)
that the parser must recognize
However, it is not a simple matter of putting the placeholder in righthand side to
generate handles. The parser needs to recognize the correct handle by different right
contexts. Consider the parser’s action at step 9.
word Stack
Handle
Action
9
×
Expr – Term ►
 none 
shift
10
id
Expr – Term × ►
 none 
shift
11
$
Expr – Term × id►
〈Factor → id,5〉
reduce
12
$
Expr–Term × Factor► 〈Term→Term×Factor,5〉 reduce
13
$
Expr – Term ►
〈Expr → Expr – Term,3〉 reduce
14
$
Expr ►
〈Goal → Expr,1〉
reduce
15
$
Goal
 none 
accept
The frontier is Expr – Term, suggesting a handle 〈Expr → Expr – Term〉•. However, the
parser decides to extend the frontier by shifting × on to the stack rather than reducing
frontier to Expr. Clearly, this the correct move for the parser. No potential handle
contains Expr followed by ×. At step 9, the set of potential handles is
〈Expr → Expr – Term•〉
〈Term →Term• × Factor〉
〈Term →Term• / Factor 〉
The next input symbol clearly matches the second choice. The parser needs a basis for
deciding between first (reduce) and second (shift) choices:
〈Expr → Expr – Term•〉
〈Term →Term• × Factor〉
This requires more context than the parser has in the frontier (stack). To choose between
reducing and shifting, the parser must recognize which symbols can occur to the right of
Expr and Term in valid phrases.
Sohail Aslam
Compiler Construction Notes Set:22
LR(1) Parsers
The LR(1) parsers can recognize precisely those languages in which onesymbol
lookahead suffices to determine whether to shift or reduce. The LR(1) construction
algorithm builds a handlerecognizing DFA. The parsing algorithm uses this DFA to
recognize handles and potential handles on the parse stack
Parsing DFA
In order to remember the state the DFA goes into on a symbol, the parser stores the DFA
state in the stack along with the symbol. Initial entry in the stack will be ‘
Parsers represent DFA as a 2D table. The rows correspond to DFA states and columns
correspond to terminals and nonterminals. The columns with terminals and the rows
form the action table while the columns with nonterminals and rows are called the
goto table. It is customary to show these tables together.
Building LR(1) Tables
To construct Action and Goto tables, the LR(1) parser generator builds a model of
handlerecognizing DFA. The model is used to fill in the tables. The LR(1)table
construction needs a concrete representation for the handles and their associated
lookahead symbols. We call this representation an LR(1) item.
0
10
1
2
3
4
7
6
5
11
8
9
int
E
E
+
(
E
)
+
int
int
(
+
)
Reduce:
E → int
on $,+
accept
on $
Reduce:
E → E+(E)
on ),+
Reduce:
E → E+(E)
on ),+
Sohail Aslam
Compiler Construction Notes Set:21
1
Lecture 24
An LR(1) item is a pair [X → α•β, a] where X → αβ is a production and a∈T
(terminals) is lookahead symbol. The model uses a set of LR(1) items to represent each
parser state. The model is called the canonical collection (CC) of set of LR(1) items.
Canonical Collection
Each set in CC represents a state in the eventual parser DFA. The construction of CC
begins by building a model of parser’s initial state. The initial state consists of the set of
LR(1) items that represent the parser’s initial state, along with any items that must also
hold in the initial state. To simplify the task of building this initial state, the construction
requires that the grammar have a unique goal symbol. The convention is to add a new
start symbol S to grammar and a production
S → E
This leads to the augmented grammar
S → E
E → E + (E)  int
The Closure Procedure
The item [S → •E, $] describes the parser’s initial state. It represents a configuration in
which recognizing S followed by $ would be a valid parse. This item, i.e., [S → •E, $]
becomes the core of the first state in CC, labeled I0. If the grammar has several distinct
productions for the start symbol, each of them generates an item in this initial core of I0.
The procedure closure does this.
closure(s) =
repeat
for each [X → α•Yβ, a] ∈ s
for each production Y → α
for each b ∈ FIRST(βa)
s ← s ∪ [Y → • γ, b]
until s is unchanged
Let’s apply this procedure to the augmented grammar.
The first set is I0 = closure({[S → •E, $] }). Equating the terms in the procedure,
s = {[S → •E, $]}, [X → α •Yβ, a] ⇔ [S → •E, $], X = S, α
= ε, Y = E,
β = ε, a = $, Y → γ
⇔
E → E + (E) and E → int
FIRST(βa) = FIRST($) = $.
Sohail Aslam
Compiler Construction Notes Set:22
2
This leads to expansion of s.
s = { [S → •E, $] } ∪ { [E → •E + (E), $] } ∪ { [E → •int, $] }
= { [S → •E, $] , [E → •E + (E), $] , [E → •int, $] }
The set s changed so we repeat. The item [S → •E, $] is already processed. The for loop
considers [X → α•Yβ, a] ⇔ [E→•E+(E),$], which leads to the match up X = E,
α= ε, Y = E, β
= +(E), a = $, Y → γ
⇔
E → E + (E), ⇔ E → int
FIRST(βa) = FIRST(+(E)$) = +. The set s is extended
s = s ∪ { [E → •E+(E), +] } ∪ { [E → •int, +] }
Sohail Aslam
Compiler Construction Notes Set:21
1
Lecture 25
The set s changed so the repeat loop is executed again. This time, however, the item
[E → •int, $/+] does not yield any more items because the dot is followed by the
terminal int. The first set of items is
I0 = {
[S → •E, $],
[E → •E+(E), $/+],
[E → •int, $/+]
}
Let’s consider the rationale behind the Closure procedure. If [A → β•Cδ,a] ∈ s, then one
potential completion for the left context is to find a string that reduces to C, followed by
δa. This completion should cause a reduction to A, since it fills out the production’s
righthand side (Cδ), and follows it with a valid lookahead symbol. For a production
C → γ, closure must insert '•' before γ and add appropriate lookahead symbols – all
terminals that can appear as the initial symbol in δa. This includes every terminal in
FIRST(δ). If ε ∈ FIRST(δ), it also includes a, thus FIRST(δa) in the algorithm.
The goto Procedure
The second critical step in the construction is to derive other parser states from I0. To
accomplish this, we compute, for each state Ii and each grammar symbol y, the state that
would arise if the parser recognized a y while in state Ii. A state s that contains
[X → α • yβ, b] has a transition (goto) labeled y to the state that contains the items
goto(s, y) where y can be terminal or a nonterminal.
goto(s, y)
m ← { }
for each item [X → α •yβ, b] ∈ s
m ← m ∪ {[X → α y•β, b]}
return closure(m)
Finite Automaton of Items
The LR(1) items are used as the states of a finite automaton (FA) that maintains
information about the parsing stack and progress of a shiftreduce parser. The FA will
start out as a nondeterministic finite automaton (NFA). A DFA can be constructed from
this NFA using the subset construction, similar to one we used for lexical analysis.
Consider the NFA of LR(0) items, i.e., no lookahead. What are the transitions of the
NFA of LR(0) items? Consider the item A→ α•γ. Suppose γ begins with symbol X which
may be a terminal (token) or nonterminal. The item can be written as A→ α•Xη.
Sohail Aslam
Compiler Construction Notes Set:22
2
Then there is a transition on symbol X for state represented by item A→ α•Xη to state
represented by item A→ αX•η. If X is a terminal, then this transition corresponds to a
shift of X from input to top of parse stack.
If X is a nonterminal, then the interpretation of this transition is more complex because
nonterminals do not appear in input. In fact, such a transition will correspond to pushing
of X onto the stack during the parse. But this can only occur during a reduction by the
production X → β. Such a reduction must be preceded by recognition of a β. The state
given by X → •β represents the beginning of this process (dot indicates we are about to
recognize β). Then for every item A→ α•Xη we must add an εtransition for every
production X → β.
A →
α•Xη
X → • β
ε
A → α•Xη
A → αX• η
X
Sohail Aslam
Compiler Construction Notes Set:21
1
Lecture 26
The initial DFA state I0 we computed is the εclosure of the set consisting of item
S → •E
Recall the stage in the closure
s = { [S → •E, $] , [E → •E + (E), $] , [E → •int, $] }
The NFA states and transitions required are
Algorithm:
Construction of collection of canonical sets of LR(1) items.
Input:
An augmented grammar G'
Output:
Collection of canonical (CC) sets of LR(1)
[S → •E,$]
ε
[E → •E+(E),$]
[E → •int,$]
ε
Sohail Aslam
Compiler Construction Notes Set:22
2
CC(G')
I0 ← {closure([S' → •S, $])}
CC ← { I0 }
repeat
for each unmarked set Ij ∈ CC
mark Ij as processed
for each X following • in an item in Ij
Ik ← goto(Ij,X)
if Ik ∉ CC then
CC ← CC ∪ Ik
record transition from Ij to Ik on X
until CC is not changing
We use the algorithm to compute the sets of LR(1) items for the augmented grammar G'
S → E
E → E + (E)  int
We computed I0; we now compute the sets goto(I0,X) for various values of X. X can be E,
int, +, ( and ) .
I1 = goto(I0,int): invokes closure({[E → int•,$/+]}). No additional closure is possible
since the dot is at the right end of the production. Thus I1 = {[E → int•, $/+]} and we
have the transition from I0 to I1 on int
I2 = goto(I0,E)
m ← {}
for each [X → •E, b] ∈ I0
⇔[S → •E, $]
⇔[E → •E+(E), $/+]
m = m ∪ {[S → E•, $]} ∪ {[E → E•+(E), $/+]}
return closure(m)
No further closure for the first item because • is at the end
In the second item, a terminal + appears after • so no further closure is possible. Thus
I2 = {[S → E•, $], [E → E• +(E), $/+]}.
We repeat the process in similar fashion.
I3 = goto(I2,+) = {[E → E + • (E), $/+]}
Sohail Aslam
Compiler Construction Notes Set:23
3
I4 = goto(I3,( )
= { [E → E + (• E), $/+], [E → • E + (E), )/+], [E → • int,
)/+]}
I3 = goto(I2,+) = {[E → E + • (E), $/+]}
I4 = goto(I3,( )
= {[E → E + (• E), $/+], [E → • E + (E), )/+], [E → • int,
)/+]}
I5 = goto(I4, int) = { [E → int •, )/+] }
I6 = goto(I4, E ) = { [E → E + (E •), $/+], [E → E • + (E), )/+] }
and so on. The sets and transitions so far yield the DFA
0
reduce E → int on ),+
int
E
+
(
E
int
reduce: E → int on $,+
accept on $
S→ •E, $
E→ •E+(E), $/+
E→ •int, $/+
1
E→ int •, $/+
2
S→ E•, $
E→ E• +(E), $/+
3
E→ E+•(E), $/+
4
E→ E+(•E), $/+
E→ •E+(E), )/+
E→ •int, )/+
6
E→ E+(•E), $/+
E→ •E+(E), )/+
E→ •int, )/+
5
E→ int •, )/+
Sohail Aslam
Compiler Construction Notes Set:21
1
Lecture 27
LR Table Construction
Construct CC = {I0,I1,I2, ... , In}, for G'. State i of the parser is constructed from the set Ii
The parsing actions for state i are determined as follows:
for each item Ii ∈ CC
if [A → α•a β, b] ∈ Ii and goto(Ii, a) == Ij then
Action[i,a] ← “shift j”
else if [A → •, a] ∈ Ii and A ≠ S' then
Action[i,a] ← “reduce A → α ”
else if [S' → S•, $] ∈ Ii then
Action[i,a] ← “accept”
end for
// the goto table
for each nonterminal A ∈ G
if goto(Ii, A) = Ij then
Goto[i, A] ← j
The initial state is the one that contains the item [S' → •S, $]. All remaining entries are
marked “error”. Let us go through an example and construct the LR table for the
augmented grammar
1.
S' → E
2.
E → T – E
3.
E → T
4.
T → F × T
5.
T → F
6.
F → id
The FIRST sets we would need are
Symbol
FIRST
S'
{ id }
E
{ id }
T
{ id }
F
{ id }
id
{ id }
×
{ × }
–
{ – }
Sohail Aslam
Compiler Construction Notes Set:22
2
We construct the canonical collection of set of LR(1)
I0 = {closure([S' → •E, $])}
{
[S' → •E, $],
[E → •T – E, $], [E → •T, $],
[T → •F×T, $], [T → •F, $]
[T → •F×T, –], [T → •F, –],
[F → •id, $], [F → •id, –],
[F → •id, ×]
}
I1 = {goto(I0, E)} = {[S' → E•, $] }
I2 = {goto(I0, T)} = {[E → T• – E,$], [E →T•,$]}
I3 = {goto(I0, F)} = {
[T → F• ×T, $],
[T → F•, $],
[T → F• ×T, –],
[T → F•, –] }
I4 = {goto(I0, id)} = {
[F → id •, $],
[F → id •, –],
[F → id •, ×] }
I5 = {goto(I2, –)} = {
[E →T– • E, $], [E → •T–E, $], [E → •T, $],
[T → •F×T, $], [T → •F×T, –], [T → •F, $], [T → •F, –],
[F → •id, $], [F → •id, –], [F → •id, ×] }
I6 = {goto(I3, ×)} = {
[T → F× •T, $], [T → F× •T, –],
[T → •F×T, $], [T → •F×T, –], [T → •F, $], [T → •F, –],
[F → •id, $], [F → •id, –], [F → •id, ×] }
I7 = {goto(I5, E)} = {[E → T – E •, $] }
I2 = {goto(I5, T)}, i.e., goto(I5, T) yields the same set as I2.
I3 = {goto(I5, F)}
I4 = {goto(I5, id)}
Sohail Aslam
Compiler Construction Notes Set:23
3
I8 = {goto(I6, T)} = { [T → F×T •, $], [T → F×T •, –] }
I3 = {goto(I6, F)}
I4 = {goto(I6, id)}
We now filling the LR(1) table by applying the rules.
Apply
1. if [A → α•aβ, b] ∈ Ii and goto(Ii, a) = Ij
then set Action[i,a] ← “shift j”. // here, a is a terminal.
I0={[S' → •E,$],[E →•T– E, $],
[E → •T, $], [T → •F×T, $],
[T → •F×T, –], [T → •F, $],
[T → •F, –], [F → •id, $],
[F → •id, –], [F → •id, ×] }
goto(I0, id) = I4
⇒ Action[0, id] ← shift 4
I2= {[E → T• – E, $], [E →T•,$]}, goto(I2, –) = I5
⇒ Action[2, –] ← shift 5
I3={[T → F• ×T, $], [T → F•, $],[T → F• ×T, –], [T → F•, –]}, goto(I3, ×) = I6
⇒ Action[3, ×] ← shift 6
goto(I5, id) = I4
⇒ Action[5, id] ← shift 4
goto(I6, id) = I4
⇒ Action[6, id] ← shift 4
Apply
2. if [A → α•, a] ∈ Ii and A ≠ S' then set action[i,a] to “reduce A → α”
I2 = { [E → T• – E, $], [E →T•,$] }
⇒ Action[2, $] ← reduce 3
I3 = {[T → F• ×T, $], [T → F•, $], [T → F• ×T, –], [T → F•, –] }
⇒ Action[3, $] ← reduce 5
⇒ Action[3, –] ← reduce 5
Sohail Aslam
Compiler Construction Notes Set:24
4
I4 = {[F → id•, $], [F → id•, –], [F → id•, ×] }
⇒ Action[4, $] ← reduce 6
⇒ Action[4, –] ← reduce 6
⇒ Action[4, ×] ← reduce 6
I7 = {[E → T – E•, $] }
⇒ Action[7, $] ← reduce 2
I8 = { [T → F×T•, $], [T → F×T•, –] }
⇒ Action[8, $] ← reduce 4
⇒ Action[8, –] ← reduce 4
Apply
3. if [S' → S•, $] ∈ Ii
then set action[i,$] to “accept”
I1 = {[S' → E•, $] }
⇒ Action[1, $] ← accept
Apply
for each nonterminal A ∈ G
if goto(Ii, A) = Ij then
goto[i, A] ← j.
goto(I0, E) = I1 ⇒ goto[0,E] ← 1
goto(I0, T) = I2 ⇒ goto[0,T] ← 2
goto(I0, F) = I3 ⇒ goto[0,F] ← 3
goto(I5, E) = I7 ⇒ goto[5,E] ← 7
goto(I5, T) = I2 ⇒ goto[5,T] ← 2
goto(I5, F) = I3 ⇒ goto[5,F] ← 3
goto(I6, T) = I8 ⇒ goto[6,T] ← 8
goto(I6, F) = I3 ⇒ goto[6,F] ← 3
Sohail Aslam
Compiler Construction Notes Set:25
5
The final table we get is
Action
Goto
id
–
×
$
E
T
F
0 s4
1
2
3
1
acc
2
s5
r3
3
r5
s6
r5
4
r6
r6
r6
5 s4
7
2
3
6 s4
8
3
7
r2
8
r4
r4
Let us parse the expression x – y × z using the LR(1) table. The scanner will encode the
input string as id – id × id $ where $ is the EOF marker
Stack
Input
¤0
id – id × id $s4
¤0id4
– id × id $r6 F→ id
¤0F3
– id × id $r5 T→ F
¤0T2
– id × id $s5
¤0T2–5
id × id $s4
¤0T2–5id4
× id $r6 F→ id
¤0T2–5F3
× id $s6
¤0T2–5F3×6
id $s4
¤0T2–5F3×6id4
$r6 F→ id
¤0T2–5F3×6F3
$r5 T→ F
¤0T2–5F3×6T8
$r4 T→ F×T
¤0T2–5T2
$r3 E→T
¤0T2 –5E7
$r2 E→T–E
¤0E1
$accept
Sohail Aslam
Compiler Construction Notes Set:21
1
Lecture 28
LR(1) Skeleton Parser
stack.push(dummy); stack.push(0);
done = false; token = scanner.next();
while (!done) {
s = stack.top();
if( Action[s,token] == “reduce A→β”) {
stack.pop(2×β);
s = stack.top();
stack.push(A);
stack.push(Goto[s,A]);
}
else if( Action[s,token] == “shift i”){
stack.push(token); stack.push(i);
token = scanner.next();
}
else if(Action[s,token] == “accept”
&& token == “$” )
done = true;
else
report error and recover;
}
report success;
Shift/Reduce Conflicts
If a DFA states contains both [X → α•aβ, b] and [Y → γ•, a]
Then on input “a” we could either shift into state [X → αa•β,b], or reduce with Y → γ.
This is called a shiftreduce conflict. Typically, this is due to ambiguities in the grammar.
The classic example of a shiftreduce conflict is the dangling else. Consider the grammar
stmt →
if E then stmt

if E then stmt else stmt
We will have DFA state containing
[stmt → if E then stmt•, else]
[stmt → if E then stmt •else stmt, x]
If else follows, we can shift
[stmt → if E then stmt else • stmt, x]
Sohail Aslam
Compiler Construction Notes Set:22
2
or reduce
[stmt → if E then stmt•, else]
Typical action is shift so that else matches with most recent if.
Sohail Aslam
Compiler Construction Notes Set:21
1
Lecture 29
Shift/Reduce Conflicts
Consider the ambiguous grammar
E → E + E  E × E  int
We will DFA state containg
[E → E × E•, +]
[E → E • + E, +]
Again we have a shift/reduce conflict. We need to reduce because × has precedence
over +
Reduce/Reduce Conflicts
If a DFA states contains both [X → α •, a] and [Y → β •, a], then on input “a” we don’t
know which production to reduce with. This is called a reducereduce conflict. Usually
due to gross ambiguity in the grammar.
LR(1) Table Size
LR(1) parsing table for even a simple language can be extremely large with thousands of
entries. It is possible to reduce the size of the table. Many states in the DFA are similar.
The core of set of LR items is the set of first components without the lookahead
terminals. For example the core of the item { [X → α•β, b], [Y → γ•δ, d] } is
{ X → α•β, Y → γ•δ
}. Consider the LR(1) states
{ [X → α •, a], [Y → β •, c] }
{ [X → α •, b], [Y → β •, d] }
They have the same core and can be merged. The merged state contains
{ [X → α •, a/b], [Y → β •, c/d]
}
These are called the LALR(1) states. LALR(1) stands for LookAhead LR(1). This leads
to tables that have 10 times fewer states than LR(1).
Sohail Aslam
Compiler Construction Notes Set:22
2
Here is the algorithm to generate LALR(1) DFA.
Repeat until all states have distinct code
choose two distinct states with same core
merge states by creating a new one with the union of all the items
point edges from predecessors to new state
new state points to all the previous successors
LALR languages are not natural. They are an efficiency hack on LR languages. Any
reasonable programming language has a LALR(1) grammar. LALR(1) has become a
standard for programming languages and for parser generators.
Sohail Aslam
Compiler Construction Notes Set:21
1
Lecture 30
Parser Generators
Parser generators exist for LL(1) and LALR(1) grammars. For example,
• LALR(1)  YACC, Bison, CUP
• LL(1) – ANTLR
• Recursive Descent  JavaCC
YACC Parser Generator
YACC – Yet Another Compiler Compiler, appeared in 1975 as a Unix application. The
other companion application Lex appeared at the same time. These two greatly aided the
construction of compilers and interpreters. The input to YACC consists of a specification
text file. The structure of the file is
definitions
%%
rules
%%
C/C++ functions
Sohail Aslam
Compiler Construction Notes Set:22
2
Here, for example, is the YACC file for a calculator
%token NUMBER LPAREN RPAREN
%token PLUS MINUS TIMES DIVIDE
%%
expr : expr PLUS expr
 expr MINUS expr
 expr TIMES expr
 expr DIVIDE expr
 LPAREN expr RPAREN
 MINUS expr
 NUMBER
;
%%
The Flex input file for a calculator is
%{
#include "y.tab.h"
%}
digit
[09]
ws
[ \t\n]+
%%
{ws}
;
{digit}+ {return NUMBER;}
"+"
{return PLUS;}
"*"
{return TIMES;}
"/"
{return DIVIDE;}
"–"
{return MINUS;}
%%
The following diagram outlines the process of building a parser with YACC and Lex.
YACC
expr.y
expr.l
y.tab.c
lex.yy.c
lex
CC
expr.exe
y.tab.h
Sohail Aslam
Compiler Construction Notes Set:590
Lecture 31
Beyond Syntax
These questions are part of contextsensitive analysis. Answers depend on values, not
parts of speech. Answers may involve computation.
These questions can be answered by using formal methods such as contextsensitive
grammars and attribute grammars or by using adhoc techniques.
One of the most popular is the use of attribute grammars.
Attribute Grammars
A CFG is augmented with a set of rules. Each symbol in the derivation has a set of values
or attributes. Rules specify how to compute a value for each attribute
Consider the grammar for signed binary numbers (SBN)
Number →
Sign List
Sign
→
+ –
List
→
List Bit  Bit
Bit
→
01
The string “–1” can be derived as follows:
Number →
Sign List
→
– List
→
– Bit
→
– 1
Similarly, the derivation for “101” is
Number
→
Sign List
→
Sign List Bit
→
Sign List 1
→
Sign List Bit 1
→
Sign List 0 1
→
Sign Bit 0 1
→
Sign 1 0 1
→
– 1 0 1
Sohail Aslam
Compiler Construction Notes Set:591
For an attributed version of SBN, the following attributes are needed
Symbol
Attributes
Number
val
Sign
neg
List
pos, val
Bit
pos, val
We will add rules to compute decimal value of a signed binary number
Productions
Attribution Rules
Number → Sign List
List.pos ← 0
if Sign.neg then
Number.val ← – List.val
else Number.val ← List.val
Sign → +
Sign.neg ← false
Sign → –
Sign.neg ← true
List0 → List1 Bit
List1.pos ← List0.pos + 1
Bit.pos ← List0.pos
List0.val ← List1.val + Bit.val
List → Bit
Bit.pos ← List.pos
List.val ← Bit.val
Bit → 0
Bit.val ← 0
Bit → 1
Bit.val ← 2Bit.pos
Attributes are associated with nodes in parse tree. Rules are value assignments associated
with productions. Rules and parse tree define an attribute dependence graph which must
be acyclic.
Sohail Aslam
Compiler Construction Notes Set:592
Attributes are distinguished based on the direction of value flow. Attributes of a node
whose values are defined wholly in terms of attributes of node’s children and from
constants are called synthesized attributes. Values used to compute synthesized attributes
flow bottomup in the parse tree.
Attributes whose values are defined in terms of a node’s own attributes, node’s siblings
and node’s parent are called inherited attributes. Values flow topdown and laterally in
the parse tree. The following attributed tree shows the inherited and synthesized attributes
for the input signed binary number 101.
Number
Sign
List
Bit
–
1
List.pos ← 0
List.val ← Bit.val = 1
Number.val ← – List.val = –1
Bit.pos ← 0
Bit.val ← 2Bit.pos = 1
Sign.neg ← true
Number
Sign
List
Bit
–
1
val: 5
List
List
Bi
t
1
Bit
0
pos: 0
val: 5
pos: 0
val: 1
pos: 1
val: 4
pos: 1
val: 0
pos: 2
val: 4
pos: 2
val: 4
neg: true
Sohail Aslam
Compiler Construction Notes Set:593
When the parse tree is peeled away, we get the attribute dependence graph
–
1
val: 5
1
0
pos: 0
val: 5
pos: 0
val: 1
pos: 1
val: 4
pos: 1
val: 0
pos: 2
val: 4
pos: 2
val: 4
neg: true
Sohail Aslam
Compiler Construction Notes Set:590
Lecture 32
Evaluation Methods
A number of ways can be used to evaluate the attributes. When using Dynamic method,
the compiler application builds the parse tree and then builds the dependence graph. A
topological sort of the graph is carried out and attributes are evaluated or defined in
topological order. In rulebased (or treewalk) methodology, the attribute rules are
analyzed at compilergeneration time. A fixed (static) ordering is determined and the
nodes in the dependency graph are evaluated this order. In oblivious (passes, dataflow)
methodology, the attribute rules and parse tree are ignored. A convenient order is picked
at compiler design time and used.
Attribute grammars have not achieved widespread use due to a number of problems. For
example: nonlocal computation, traversing parse tree, storage management for short
lived attributes and lack of highquality inexpensive tools. However, a variation of
attribute grammars and evaluation schemes is used in many compilers. This variation is
called adhoc analysis.
In rulebased evaluators, a sequence of actions is associated with grammar productions.
Organizing actions required for contextsensitive analysis around structure of the
grammar leads to powerful, albeit adhoc, approach which is used on most parsers. A
snippet of code (action) is associated with each production that executes at parse time
In topdown parsers, the snippet is added to the appropriate parsing routine. In a bottom
up shiftreduce parsers, the actions are performed each time the parser performs a
reduction. Here the LR(1) skeleton parser indicating the place where the snippet is
executed.
stack.push(dummy); stack.push(0);
done = false; token = scanner.next();
while (!done) {
s = stack.top();
if( Action[s,token] == “reduce A→β”) {
invoke the code snippet
stack.pop(2×β);
s = stack.top();
stack.push(A);
stack.push(Goto[s,A]);
}
else if( Action[s,token] == “shift i”){
stack.push(token); stack.push(i);
token = scanner.next();
}
}