COUSINS OF COMPILER

COUSINS OF COMPILER
1. Preprocessor 2. Assembler 3. Loader and Link-editor
Preprocessor
A preprocessor is a program that processes its input data to produce output that is used as
input to another program. The output is said to be a preprocessed form of the input data,
which is often used by some subsequent programs like compilers.
They may perform the following functions :
1. Macro processing 3. Rational Preprocessors
2. File Inclusion 4. Language extension
1. Macro processing:
A macro is a rule or pattern that specifies how a certain input sequence should be mapped to
an output sequence according to a defined procedure. The mapping process that instantiates a
macro into a specific output sequence is known as macro expansion.
2. File Inclusion:
Preprocessor includes header files into the program text. When the preprocessor finds an
#include directive it replaces it by the entire content of the specified file.
3. Rational Preprocessors:
These processors change older languages with more modern flow-of-control and data-
structuring facilities.
4. Language extension :
These processors attempt to add capabilities to the language by what amounts to built-in
macros. For example, the language Equel is a database query language embedded in C.
Assembler
Assembler creates object code by translating assembly instruction mnemonics into machine
code. There are two types of assemblers:
One-pass assemblers go through the source code once and assume that all symbols will
be defined before any instruction that references them.
Two-pass assemblers create a table with all symbols and their values in the first pass,
and then use the table in a second pass to generate code
Fig. 1.7 Translation of a statement

Linker and Loader
A linker or link editor is a program that takes one or more objects generated by a compiler and
combines them into a single executable program. Three tasks of the linker are
1.Searches the program to find library routines used by program, e.g. printf(), math routines.
2. Determines the memory locations that code from each module will occupy and relocates its
instructions by adjusting absolute references 3. Resolves references among files.
A loader is the part of an operating system that is responsible for loading programs in
memory, one of the essential stages in the process of starting a program.
The compilation process is a sequence of various phases. Each phase takes input from its previous stage, has its
own representation of source program, and feeds its output to the next phase of the compiler. Let us understand
the phases of a compiler.
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code as a stream of characters
and converts it into meaningful lexemes. Lexical analyzer represents these lexemes in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical analysis as input
and generates a parse tree (or syntax tree). In this phase, token arrangements are checked against the source
code grammar, i.e. the parser checks if the expression made by the tokens is syntactically correct.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of language. For example,
assignment of values is between compatible data types, and adding string to an integer. Also, the semantic
analyzer keeps track of identifiers, their types and expressions; whether identifiers are declared before use or
not etc. The semantic analyzer produces an annotated syntax tree as an output.
Intermediate Code Generation
After semantic analysis the compiler generates an intermediate code of the source code for the target machine. It
represents a program for some abstract machine. It is in between the high-level language and the machine
language. This intermediate code should be generated in such a way that it makes it easier to be translated into
the target machine code.
Code Optimization
The next phase does code optimization of the intermediate code. Optimization can be assumed as something
that removes unnecessary code lines, and arranges the sequence of statements in order to speed up the program
execution without wasting resources (CPU, memory).
Code Generation
In this phase, the code generator takes the optimized representation of the intermediate code and maps it to the
target machine language. The code generator translates the intermediate code into a sequence of (generally) re-
locatable machine code. Sequence of instructions of machine code performs the task as the intermediate code
would do.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifier's names along with
their types are stored here. The symbol table makes it easier for the compiler to quickly search the identifier
record and retrieve it. The symbol table is also used for scope management.
Phases of Compiler - Compiler Design
by Dinesh Thakur
The structure of compiler consists of two parts:
Analysis part
Analysis part breaks the source program into constituent pieces and imposes a grammatical
structure on them which further uses this structure to create an intermediate representation of the
source program.
It is also termed as front end of compiler.
Information about the source program is collected and stored in a data structure called symbol
table.
Synthesis part
Synthesis part takes the intermediate representation as input and transforms it to the target
program.
It is also termed as back end of compiler.
The design of compiler can be decomposed into several phases, each of which converts one form of
source program into another.
The different phases of compiler are as follows:
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Intermediate code generation
5. Code optimization
6. Code generation
All of the aforementioned phases involve the following tasks:
Symbol table management.
Error handling.
Lexical Analysis
Lexical analysis is the first phase of compiler which is also termed as scanning.
Source program is scanned to read the stream of characters and those characters are grouped to
form a sequence called lexemes which produces token as output.
Token: Token is a sequence of characters that represent lexical unit, which matches with the
pattern, such as keywords, operators, identifiers etc.
Lexeme: Lexeme is instance of a token i.e., group of characters forming a token. ,
Pattern: Pattern describes the rule that the lexemes of a token takes. It is the structure that
must be matched by strings.
Once a token is generated the corresponding entry is made in the symbol table.
Input: stream of characters
Output: Token
Token Template: <token-name, attribute-value>
(eg.) c=a+b*5;
Lexemes and tokens
Lexemes Tokens
c identifier
= assignment symbol
a identifier
+ + (addition symbol)
b identifier
* * (multiplication symbol)
5 5 (number)
Hence, <id, 1><=>< id, 2>< +><id, 3 >< * >< 5>
Syntax Analysis
Syntax analysis is the second phase of compiler which is also called as parsing.
Parser converts the tokens produced by lexical analyzer into a tree like representation called
parse tree.
A parse tree describes the syntactic structure of the input.
Syntax tree is a compressed representation of the parse tree in which the operators appear as
interior nodes and the operands of the operator are the children of the node for that operator.
Input: Tokens
Output: Syntax tree

Semantic Analysis
Semantic analysis is the third phase of compiler.
It checks for the semantic consistency.
Type information is gathered and stored in symbol table or in syntax tree.
Performs type checking.
Intermediate Code Generation
Intermediate code generation produces intermediate representations for the source program
which are of the following forms:
o Postfix notation
o Three address code
o Syntax tree
Most commonly used form is the three address code.
t1 = inttofloat (5)
t2 = id3* tl
t3 = id2 + t2
id1 = t3
Properties of intermediate code
It should be easy to produce.
It should be easy to translate into target program.
Code Optimization
Code optimization phase gets the intermediate code as input and produces optimized
intermediate code as output.
It results in faster running machine code.
It can be done by reducing the number of lines of code for a program.
This phase reduces the redundant code and attempts to improve the intermediate code so that
faster-running machine code will result.
During the code optimization, the result of the program is not affected.
To improve the code generation, the optimization involves
o Deduction and removal of dead code (unreachable code).
o Calculation of constants in expressions and terms.
o Collapsing of repeated expression into temporary string.
o Loop unrolling.
o Moving code outside the loop.
o Removal of unwanted temporary variables.
t1 = id3* 5.0
id1 = id2 + t1
Code Generation
Code generation is the final phase of a compiler.
It gets input from code optimization phase and produces the target code or object code as result.
Intermediate instructions are translated into a sequence of machine instructions that perform the
same task.
The code generation involves

o Allocation of register and memory.
o Generation of correct references.
o Generation of correct data types.
o Generation of missing code.
LDF R2, id3
MULF R2, # 5.0
LDF R1, id2
ADDF R1, R2
STF id1, R1
Symbol Table Management
Symbol table is used to store all the information about identifiers used in the program.
It is a data structure containing a record for each identifier, with fields for the attributes of the
identifier.
It allows finding the record for each identifier quickly and to store or retrieve data from that
record.
Whenever an identifier is detected in any of the phases, it is stored in the symbol table.
Example
int a, b; float c; char z;
Symbol name Type Address

a Int 1000
b Int 1002
c Float 1004
z char 1008
Example
extern double test (double x);
double sample (int count)
double sum= 0.0;
for (int i = 1; i < = count; i++)
sum+= test((double) i);
return sum;
Symbol name Type Scope

test function, double extern
x double function parameter
sample function, double global
count int function parameter
sum double block local
i int for-loop statement
Error Handling
Each phase can encounter errors. After detecting an error, a phase must handle the error so that
compilation can proceed.
In lexical analysis, errors occur in separation of tokens.
In syntax analysis, errors occur during construction of syntax tree.
In semantic analysis, errors may occur at the following cases:

(i) When the compiler detects constructs that have right syntactic structure but no meaning
(ii) During type conversion.
In code optimization, errors occur when the result is affected by the optimization. In code
generation, it shows error when code is missing etc.
Figure illustrates the translation of source code through each phase, considering the statement
c =a+ b * 5.
Error Encountered in Different Phases
Each phase can encounter errors. After detecting an error, a phase must some how deal with the
error, so that compilation can proceed.
A program may have the following kinds of errors at various stages:
Lexical Errors
It includes incorrect or misspelled name of some identifier i.e., identifiers typed incorrectly.
Syntactical Errors
It includes missing semicolon or unbalanced parenthesis. Syntactic errors are handled by syntax
analyzer (parser).
When an error is detected, it must be handled by parser to enable the parsing of the rest of the
input. In general, errors may be expected at various stages of compilation but most of the errors
are syntactic errors and hence the parser should be able to detect and report those errors in the
program.
The goals of error handler in parser are:
Report the presence of errors clearly and accurately.

Recover from each error quickly enough to detect subsequent errors.
Add minimal overhead to the processing of correcting programs.
There are four common error-recovery strategies that can be implemented in the parser to deal
with errors in the code.
o Panic mode.
o Statement level.
o Error productions.
o Global correction.
Semantical Errors
These errors are a result of incompatible value assignment. The semantic errors that the semantic
analyzer is expected to recognize are:
Type mismatch.
Undeclared variable.
Reserved identifier misuse.
Multiple declaration of variable in a scope.
Accessing an out of scope variable.
Actual and formal parameter mismatch.
Logical errors
These errors occur due to not reachable code-infinite loop.

What is LEX? Use of Lex.
by Dinesh Thakur
Lex is a tool in lexical analysis phase to recognize tokens using regular expression.
Lex tool itself is a lex compiler.

Use of Lex
lex.l is an a input file written in a language which describes the generation of lexical analyzer.
The lex compiler transforms lex.l to a C program known as lex.yy.c.
lex.yy.c is compiled by the C compiler to a file called a.out.
The output of C compiler is the working lexical analyzer which takes stream of input characters
and produces a stream of tokens.
yylval is a global variable which is shared by lexical analyzer and parser to return the name and
an attribute value of token.
The attribute value can be numeric code, pointer to symbol table or nothing.
Another tool for lexical analyzer generation is Flex.
Structure of Lex Programs
Lex program will be in following form
declarations
%%
translation rules
%%
auxiliary functions
Declarations This section includes declaration of variables, constants and regular definitions.
Translation rules It contains regular expressions and code segments.
Form : Pattern {Action}

Pattern is a regular expression or regular definition.
Action refers to segments of code.
Auxiliary functions This section holds additional functions which are used in actions. These
functions are compiled separately and loaded with lexical analyzer.
Lexical analyzer produced by lex starts its process by reading one character at a time until a valid
match for a pattern is found.
Once a match is found, the associated action takes place to produce token.
The token is then given to parser for further processing.
Conflict Resolution in Lex
Conflict arises when several prefixes of input matches one or more patterns. This can be resolved
by the following:
Always prefer a longer prefix than a shorter prefix.
If two or more patterns are matched for the longest prefix, then the first pattern listed in lex
program is preferred.
Lookahead Operator
Lookahead operator is the additional operator that is read by lex in order to distinguish additional
pattern for a token.
Lexical analyzer is used to read one character ahead of valid lexeme and then retracts to
produce token.
At times, it is needed to have certain characters at the end of input to match with a pattern. In
such cases, slash (/) is used to indicate end of part of pattern that matches the lexeme.
(eg.) In some languages keywords are not reserved. So the statements
IF (I, J) = 5 and IF(condition) THEN
results in conflict whether to produce IF as an array name or a keyword. To resolve this the lex rule
for keyword IF can be written as,
IF/\ (.* \) {
letter }
Design of Lexical Analyzer
Lexical analyzer can either be generated by NFA or by DFA.
DFA is preferable in the implementation of lex.

Structure of Generated Analyzer
Architecture of lexical analyzer generated by lex is given in Fig.
Lexical analyzer program includes:
Program to simulate automata
Components created from lex program by lex itself which are listed as follows:
o A transition table for automaton.
o Functions that are passed directly through lex to the output.
o Actions from input program (fragments of code) which are invoked by automaton simulator
when needed.
Steps to construct automaton
Step 1: Convert each regular expression into NFA either by Thompson's subset construction or
Direct Method.
Step 2: Combine all NFAs into one by introducing new start state with s-transitions to each of start
states of NFAs Ni for pattern Pi
Step 2 is needed as the objective is to construct single automaton to recognize lexemes that
matches with any of the patterns.
(eg.) a {action A1 for pattern Pl}
abb { action A2 for pattern P2 }
a*b+ { action A3 for pattern P3 }
For string obb, pattern P2 and pattern p3 matches. But the pattern P2 will be taken into account as
it was listed first in lex program.
For string aabbb , matches pattern p3 as it has many prefixes.
Fig. Shows NFAs for recognizing the above mentioned three patterns.
The combined NFA for all three given patterns is shown in Fig.
Pattern Matching Based on NFAs
Lexical analyzer reads input from input buffer from the beginning of lexeme pointed by the pointer
lexemeBegin. Forward pointer is used to move ahead of input symbols, calculates the set of states
it is in at each point. If NFA simulation has no next state for some input symbol, then there will be
no longer prefix which reaches the accepting state exists. At such cases, the decision will be made
on the so seen longest prefix i.e., lexeme matching some pattern. Process is repeated until one or
more accepting states are reached. If there are several accepting states, then the pattern Pi which
appears earliest in the list of lex program is chosen.
e.g.
W= aaba
Explanation
Process starts with s-closure of initial state 0. After processing all the input symbols, no state is
found as there is no transition out of state 8 on input a. Hence, look for accepting state by
retracting to previous state. From Fig. state 2 which is an accepting state is reached after reading
input symbol a and therefore the pattern a has been matched. At state 8, string aab has been
matched with pattern avb": By Lex rule, the longest matching prefix should be considered. Hence,
action Ag corresponds to pattern p3 will be executed for the string aab.
DFAs for Lexical Analyzers
DFAs are also used to represent the output oflex. DFA is constructed from NFA, by converting all
the patterns into equivalent DFA using subset construction algorithm. If there are one or more
accepting NFA states, the first pattern whose accepting state is represented in each DFA state is
determined and displayed as output of DFA state. Process of DFA is similar to that of NFA.
Simulation of DFA is continued until no next state is found. Then retraction takes place to find the
accepting state of DFA. Action associated with the pattern for that state is executed.
Implementing Lookahead Operator
Lookahead operator r1/r2 is needed because the pattern r1 for a particular token may need to
describe some trailing context r2 in order to correctly identify the actual lexeme.
For the pattern r1/r2, / is treated as .
If some prefix ab, is recognized by NFA as a match for regular expression then the lexeme is not
ended as NFA reaches the accepting state.
The end of lexeme occurs when NFA enters a state p such that
p has an -transition on I,
There is a path from start state to state p, that spells out a.
There is a path from state p to accepting state that spells out b.
a is as Jong as possible for any ab satisfying conditions 1 - 3.
Figure shows the NFA for recognizing the keyword IF with lookahead. Transition from state 2 to
state 3 represents the lookahead operator (-transition).
Accepting state is state 6, which indicates the presence of keyword IF. Hence, the lexeme IF is
found by looking backwards to the state 2, whenever accepting state (state 6) is reached.

COUSINS OF COMPILER

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COUSINS OF COMPILER

Uploaded by

Copyright:

Available Formats

COUSINS OF COMPILER

1. Preprocessor 2. Assembler 3. Loader and Link-editor

Fig. 1.7 Translation of a statement

Intermediate Code Generation

The structure of compiler consists of two parts:

It is also termed as front end of compiler.

It is also termed as back end of compiler.

The different phases of compiler are as follows:

4. Intermediate code generation

Symbol table management.

Lexeme: Lexeme is instance of a token i.e., group of characters forming a token. ,

Input: stream of characters

Lexemes and tokens

Hence, <id, 1><=>< id, 2>< +><id, 3 >< * >< 5>

A parse tree describes the syntactic structure of the input.

Output: Syntax tree

Semantic analysis is the third phase of compiler.

It checks for the semantic consistency.

Type information is gathered and stored in symbol table or in syntax tree.

Performs type checking.

Intermediate Code Generation

o Three address code

Most commonly used form is the three address code.

It should be easy to produce.

It should be easy to translate into target program.

It results in faster running machine code.

It can be done by reducing the number of lines of code for a program.

To improve the code generation, the optimization involves

o Deduction and removal of dead code (unreachable code).

o Calculation of constants in expressions and terms.

o Collapsing of repeated expression into temporary string.

o Moving code outside the loop.

o Removal of unwanted temporary variables.

Code generation is the final phase of a compiler.

The code generation involves

o Generation of correct references.

o Generation of correct data types.

o Generation of missing code.

LDF R2, id3

MULF R2, # 5.0

LDF R1, id2

Symbol Table Management

int a, b; float c; char z;

Symbol name Type Address

extern double test (double x);

double sample (int count)

double sum= 0.0;

for (int i = 1; i < = count; i++)

sum+= test((double) i);

Symbol name Type Scope

In lexical analysis, errors occur in separation of tokens.

In syntax analysis, errors occur during construction of syntax tree.

In semantic analysis, errors may occur at the following cases:

(ii) During type conversion.

Error Encountered in Different Phases

The goals of error handler in parser are:

Report the presence of errors clearly and accurately.

These errors occur due to not reachable code-infinite loop.

Lex tool itself is a lex compiler.

lex.yy.c is compiled by the C compiler to a file called a.out.

Another tool for lexical analyzer generation is Flex.

Structure of Lex Programs