Professional Documents
Culture Documents
Preprocessor
A preprocessor is a program that processes its input data to produce output that is used as
input to another program. The output is said to be a preprocessed form of the input data,
which is often used by some subsequent programs like compilers.
They may perform the following functions :
1. Macro processing 3. Rational Preprocessors
2. File Inclusion 4. Language extension
1. Macro processing:
A macro is a rule or pattern that specifies how a certain input sequence should be mapped to
an output sequence according to a defined procedure. The mapping process that instantiates a
macro into a specific output sequence is known as macro expansion.
2. File Inclusion:
Preprocessor includes header files into the program text. When the preprocessor finds an
#include directive it replaces it by the entire content of the specified file.
3. Rational Preprocessors:
These processors change older languages with more modern flow-of-control and data-
structuring facilities.
4. Language extension :
These processors attempt to add capabilities to the language by what amounts to built-in
macros. For example, the language Equel is a database query language embedded in C.
Assembler
Assembler creates object code by translating assembly instruction mnemonics into machine
code. There are two types of assemblers:
One-pass assemblers go through the source code once and assume that all symbols will
be defined before any instruction that references them.
Two-pass assemblers create a table with all symbols and their values in the first pass,
and then use the table in a second pass to generate code
A linker or link editor is a program that takes one or more objects generated by a compiler and
combines them into a single executable program. Three tasks of the linker are
1.Searches the program to find library routines used by program, e.g. printf(), math routines.
2. Determines the memory locations that code from each module will occupy and relocates its
instructions by adjusting absolute references 3. Resolves references among files.
A loader is the part of an operating system that is responsible for loading programs in
memory, one of the essential stages in the process of starting a program.
The compilation process is a sequence of various phases. Each phase takes input from its previous stage, has its
own representation of source program, and feeds its output to the next phase of the compiler. Let us understand
the phases of a compiler.
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code as a stream of characters
and converts it into meaningful lexemes. Lexical analyzer represents these lexemes in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical analysis as input
and generates a parse tree (or syntax tree). In this phase, token arrangements are checked against the source
code grammar, i.e. the parser checks if the expression made by the tokens is syntactically correct.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of language. For example,
assignment of values is between compatible data types, and adding string to an integer. Also, the semantic
analyzer keeps track of identifiers, their types and expressions; whether identifiers are declared before use or
not etc. The semantic analyzer produces an annotated syntax tree as an output.
After semantic analysis the compiler generates an intermediate code of the source code for the target machine. It
represents a program for some abstract machine. It is in between the high-level language and the machine
language. This intermediate code should be generated in such a way that it makes it easier to be translated into
the target machine code.
Code Optimization
The next phase does code optimization of the intermediate code. Optimization can be assumed as something
that removes unnecessary code lines, and arranges the sequence of statements in order to speed up the program
execution without wasting resources (CPU, memory).
Code Generation
In this phase, the code generator takes the optimized representation of the intermediate code and maps it to the
target machine language. The code generator translates the intermediate code into a sequence of (generally) re-
locatable machine code. Sequence of instructions of machine code performs the task as the intermediate code
would do.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifier's names along with
their types are stored here. The symbol table makes it easier for the compiler to quickly search the identifier
record and retrieve it. The symbol table is also used for scope management.
Phases of Compiler - Compiler Design
by Dinesh Thakur
Analysis part
Analysis part breaks the source program into constituent pieces and imposes a grammatical
structure on them which further uses this structure to create an intermediate representation of the
source program.
Information about the source program is collected and stored in a data structure called symbol
table.
Synthesis part
Synthesis part takes the intermediate representation as input and transforms it to the target
program.
The design of compiler can be decomposed into several phases, each of which converts one form of
source program into another.
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
5. Code optimization
6. Code generation
All of the aforementioned phases involve the following tasks:
Error handling.
Lexical Analysis
Lexical analysis is the first phase of compiler which is also termed as scanning.
Source program is scanned to read the stream of characters and those characters are grouped to
form a sequence called lexemes which produces token as output.
Token: Token is a sequence of characters that represent lexical unit, which matches with the
pattern, such as keywords, operators, identifiers etc.
Pattern: Pattern describes the rule that the lexemes of a token takes. It is the structure that
must be matched by strings.
Once a token is generated the corresponding entry is made in the symbol table.
Output: Token
Token Template: <token-name, attribute-value>
(eg.) c=a+b*5;
Lexemes Tokens
c identifier
= assignment symbol
a identifier
+ + (addition symbol)
b identifier
* * (multiplication symbol)
5 5 (number)
Syntax Analysis
Syntax analysis is the second phase of compiler which is also called as parsing.
Parser converts the tokens produced by lexical analyzer into a tree like representation called
parse tree.
Syntax tree is a compressed representation of the parse tree in which the operators appear as
interior nodes and the operands of the operator are the children of the node for that operator.
Input: Tokens
Intermediate code generation produces intermediate representations for the source program
which are of the following forms:
o Postfix notation
o Syntax tree
t1 = inttofloat (5)
t2 = id3* tl
t3 = id2 + t2
id1 = t3
Properties of intermediate code
Code Optimization
Code optimization phase gets the intermediate code as input and produces optimized
intermediate code as output.
This phase reduces the redundant code and attempts to improve the intermediate code so that
faster-running machine code will result.
During the code optimization, the result of the program is not affected.
o Loop unrolling.
t1 = id3* 5.0
id1 = id2 + t1
Code Generation
It gets input from code optimization phase and produces the target code or object code as result.
Intermediate instructions are translated into a sequence of machine instructions that perform the
same task.
ADDF R1, R2
STF id1, R1
Symbol table is used to store all the information about identifiers used in the program.
It is a data structure containing a record for each identifier, with fields for the attributes of the
identifier.
It allows finding the record for each identifier quickly and to store or retrieve data from that
record.
Whenever an identifier is detected in any of the phases, it is stored in the symbol table.
Example
return sum;
Each phase can encounter errors. After detecting an error, a phase must handle the error so that
compilation can proceed.
In code optimization, errors occur when the result is affected by the optimization. In code
generation, it shows error when code is missing etc.
Figure illustrates the translation of source code through each phase, considering the statement
c =a+ b * 5.
Each phase can encounter errors. After detecting an error, a phase must some how deal with the
error, so that compilation can proceed.
A program may have the following kinds of errors at various stages:
Lexical Errors
It includes incorrect or misspelled name of some identifier i.e., identifiers typed incorrectly.
Syntactical Errors
It includes missing semicolon or unbalanced parenthesis. Syntactic errors are handled by syntax
analyzer (parser).
When an error is detected, it must be handled by parser to enable the parsing of the rest of the
input. In general, errors may be expected at various stages of compilation but most of the errors
are syntactic errors and hence the parser should be able to detect and report those errors in the
program.
There are four common error-recovery strategies that can be implemented in the parser to deal
with errors in the code.
o Panic mode.
o Statement level.
o Error productions.
o Global correction.
Semantical Errors
These errors are a result of incompatible value assignment. The semantic errors that the semantic
analyzer is expected to recognize are:
Type mismatch.
Undeclared variable.
Reserved identifier misuse.
Multiple declaration of variable in a scope.
Accessing an out of scope variable.
Actual and formal parameter mismatch.
Logical errors
Lex is a tool in lexical analysis phase to recognize tokens using regular expression.
lex.l is an a input file written in a language which describes the generation of lexical analyzer.
The lex compiler transforms lex.l to a C program known as lex.yy.c.
The output of C compiler is the working lexical analyzer which takes stream of input characters
and produces a stream of tokens.
yylval is a global variable which is shared by lexical analyzer and parser to return the name and
an attribute value of token.
The attribute value can be numeric code, pointer to symbol table or nothing.
declarations
%%
translation rules
%%
auxiliary functions
Declarations This section includes declaration of variables, constants and regular definitions.
Auxiliary functions This section holds additional functions which are used in actions. These
functions are compiled separately and loaded with lexical analyzer.
Lexical analyzer produced by lex starts its process by reading one character at a time until a valid
match for a pattern is found.
Once a match is found, the associated action takes place to produce token.
Conflict arises when several prefixes of input matches one or more patterns. This can be resolved
by the following:
If two or more patterns are matched for the longest prefix, then the first pattern listed in lex
program is preferred.
Lookahead Operator
Lookahead operator is the additional operator that is read by lex in order to distinguish additional
pattern for a token.
Lexical analyzer is used to read one character ahead of valid lexeme and then retracts to
produce token.
At times, it is needed to have certain characters at the end of input to match with a pattern. In
such cases, slash (/) is used to indicate end of part of pattern that matches the lexeme.
results in conflict whether to produce IF as an array name or a keyword. To resolve this the lex rule
for keyword IF can be written as,
IF/\ (.* \) {
letter }
Components created from lex program by lex itself which are listed as follows:
o Actions from input program (fragments of code) which are invoked by automaton simulator
when needed.
Step 1: Convert each regular expression into NFA either by Thompson's subset construction or
Direct Method.
Step 2: Combine all NFAs into one by introducing new start state with s-transitions to each of start
states of NFAs Ni for pattern Pi
Step 2 is needed as the objective is to construct single automaton to recognize lexemes that
matches with any of the patterns.
(eg.) a {action A1 for pattern Pl}
For string obb, pattern P2 and pattern p3 matches. But the pattern P2 will be taken into account as
it was listed first in lex program.
Fig. Shows NFAs for recognizing the above mentioned three patterns.
The combined NFA for all three given patterns is shown in Fig.
Pattern Matching Based on NFAs
Lexical analyzer reads input from input buffer from the beginning of lexeme pointed by the pointer
lexemeBegin. Forward pointer is used to move ahead of input symbols, calculates the set of states
it is in at each point. If NFA simulation has no next state for some input symbol, then there will be
no longer prefix which reaches the accepting state exists. At such cases, the decision will be made
on the so seen longest prefix i.e., lexeme matching some pattern. Process is repeated until one or
more accepting states are reached. If there are several accepting states, then the pattern Pi which
appears earliest in the list of lex program is chosen.
e.g.
W= aaba
Explanation
Process starts with s-closure of initial state 0. After processing all the input symbols, no state is
found as there is no transition out of state 8 on input a. Hence, look for accepting state by
retracting to previous state. From Fig. state 2 which is an accepting state is reached after reading
input symbol a and therefore the pattern a has been matched. At state 8, string aab has been
matched with pattern avb": By Lex rule, the longest matching prefix should be considered. Hence,
action Ag corresponds to pattern p3 will be executed for the string aab.
DFAs for Lexical Analyzers
DFAs are also used to represent the output oflex. DFA is constructed from NFA, by converting all
the patterns into equivalent DFA using subset construction algorithm. If there are one or more
accepting NFA states, the first pattern whose accepting state is represented in each DFA state is
determined and displayed as output of DFA state. Process of DFA is similar to that of NFA.
Simulation of DFA is continued until no next state is found. Then retraction takes place to find the
accepting state of DFA. Action associated with the pattern for that state is executed.
Lookahead operator r1/r2 is needed because the pattern r1 for a particular token may need to
describe some trailing context r2 in order to correctly identify the actual lexeme.
If some prefix ab, is recognized by NFA as a match for regular expression then the lexeme is not
ended as NFA reaches the accepting state.
The end of lexeme occurs when NFA enters a state p such that
p has an -transition on I,
Figure shows the NFA for recognizing the keyword IF with lookahead. Transition from state 2 to
state 3 represents the lookahead operator (-transition).
Accepting state is state 6, which indicates the presence of keyword IF. Hence, the lexeme IF is
found by looking backwards to the state 2, whenever accepting state (state 6) is reached.