You are on page 1of 10

2G1508-L01

Introduction
Lexical Analysis

Overview





Christian Schulte
IMIT, KTH

Organizational
Course overview
Compiler structure
Lexical analysis

www.imit.kth.se/~schulte/

2005-10-25

2G1508-L01, Christian Schulte

2005-10-25

2G1508-L01, Christian Schulte

Textbook
 Andrew W. Appel, Modern Compiler
Implementation in Java
2nd edition, Cambridge University Press,
2002.

Organizational

2005-10-25

2G1508-L01, Christian Schulte

2005-10-25

2G1508-L01, Christian Schulte

Kursnmnd

Elect and Sign Up!

 Two volunteers needed!

 Sign up on the list (most likely you'll have to


write down all your details)
 Do not forget to elect the course

2005-10-25

2G1508-L01, Christian Schulte

2005-10-25

2G1508-L01, Christian Schulte

No labs

Examination

 There will be no labs this time

 course passed
 labs passed
 full exam

 lab sessions are cancelled

240 points

 Lab part of course







three assignments (10 points each) to be submitted


corrected by Mikael Lagerkvist
at least 15 points required to pass
points valid as bonus points on exam if submitted in
time (only this academic year)

2005-10-25

2G1508-L01, Christian Schulte

2005-10-25

2G1508-L01, Christian Schulte

Reading Suggestion
 Chapters 1 and 2

Course Overview

2005-10-25

2G1508-L01, Christian Schulte

Compiler and Execution


Environments

2005-10-25

2G1508-L01, Christian Schulte

Compiler

 General question: how to execute program


written in some high-level programming
language

 Compiler translates program from one


programming language into another

 Two aspects

 Source language: for programming

 language compiled from


 language compiled to

source language
target language

 examples: Java, C, C++, Oz,

 compilation transform into language good for


 execution

10

execution
execute program

 Target language: for execution

 examples: assembler (x86, MIPS, ), JVM code

2005-10-25

2G1508-L01, Christian Schulte

11

2005-10-25

2G1508-L01, Christian Schulte

12

Execution Environments
 Can be concrete hardware

Compilation

 how to manage memory


 how to link and load programs
 take advantage of architectural features

Basic structure and tasks

 Can be abstract machine


 how to interpret abstract machine code efficiently
 how to further compile at runtime

2005-10-25

2G1508-L01, Christian Schulte

13

Compilation Phases

2005-10-25

2G1508-L01, Christian Schulte

14

Frontend: Tasks
 Lexical analysis

source
program

frontend

backend

 Syntax analysis

intermediate
representation

 phrasal structure of program (sentences)


 grammar rules describing how expressions, statements, etc
are formed
 creates abstract syntax tree

 Frontend depends on source language


 Backend depends on target language
 Factorize dependencies
2005-10-25

2G1508-L01, Christian Schulte

 how program is composed into tokens (words)


 typical token classes: identifier, number, keywords,
 creates token stream

target
program

 Semantic analysis

 perform identifier analysis (scope), type checking,


 creates intermediate representation trees


15

after that: canonicalize and clean up

2005-10-25

2G1508-L01, Christian Schulte

Backend: Basic Tasks

Optimization

 Optimization

 Common subexpression elimination (CSE)


 reuse intermediate results

 reduce execution time and program size


 typically independent of target architecture
 intermediate and complex component: "midend"

 Dead-code elimination

 remove code that can never be executed

 Instruction selection

 Strength reduction

 Register allocation

 Constant/value propagation

 make operations in loops cheaper: instead of multiplying


with n, increment by n (array access)

 which instruction for a certain abstract operation


 which variables are kept in which registers?
 which variables go to memory

 propagate information on values of variables

 Code motion

 More generic: memory allocation


 Code emission
2005-10-25

2G1508-L01, Christian Schulte

16

 move invariant code out of loops

 Many, many more,


17

2005-10-25

2G1508-L01, Christian Schulte

18

Overall Structure
 Compiler has two main phases

Lexical Analysis

 analysis

 synthesis

understand program
"front end"
put it together in different way
"back end"

 Analysis typically broken up into


 lexical
 syntax
 semantic

2005-10-25

2G1508-L01, Christian Schulte

19

2005-10-25

break into words or "tokens"


parse phrase structure of program
calculate program's meaning
2G1508-L01, Christian Schulte

Lexical Analyzer

Lexical Tokens

 Also: lexer
 Takes a stream of characters
 Produces a stream of tokens

 Sequence of characters treated as unit in grammar


of programming language
 Programming language classifies tokens into finite
set of token types






names
keywords
punctuation marks
discards white space and comments

 some tokens have semantic value attached (ID, NUM, )

 Punctuation tokens such as IF, VOID, RETURN


constructed from characters: reserved words
 cannot be used as identifiers

 Non-tokens

 Simple task
2005-10-25

 comments, preprocessor directives, whitespace


2G1508-L01, Christian Schulte

21

Example Token Types


ID
NUM
REAL
IF
COMMA
NOTEQ
LPAREN
RPAREN

2005-10-25

20

2G1508-L01, Christian Schulte

22

Example Program
float match0(char* s) {
/* find a zero */
if (!strncmp(s, "0.0", 3))
return 0.;
}

foo n14 last


73 0 00 5151
3.75 .2 1e23 5.5e-10
if
,
!=
(
)

2G1508-L01, Christian Schulte

2005-10-25

23

2005-10-25

2G1508-L01, Christian Schulte

24

Example Token Stream


FLOAT
CHAR
RPAREN
LPAREN
LPAREN
STRING(0.0)
RPAREN
REAL(0.0)
EOF
2005-10-25

ID(match0)
STAR
LBRACE
BANG
ID(s)
COMMA
RPAREN
SEMI

Approach
 Specification of lexical tokens
regular expression (regexp)

LPAREN
ID(s)
IF
ID(strncmp)
COMMA
NUM(3)
RETURN
RBRACE

2G1508-L01, Christian Schulte

 Implementation of lexer
deterministic finite automaton (DFA)
 Computing DFA from regexp
nondeterministic finite automaton (NFA)

25

2005-10-25

2G1508-L01, Christian Schulte

Regular Expressions

Regular Expressions

 Language:
 String:

 Symbol

 Alternation

M|N

set of strings
finite sequence of symbols

 denotes language just containing string a

 symbols are taken from finite alphabet

 where M and N are regular expressions


 string in language of M|N, if string in language of M or

 Example

 language of primes:

decimal digit strings


representing prime numbers
 alphabet is ASCII character set

in language of N

 Concatenation

strings and such that in language of M and in


language of N

 possibly infinite set

2G1508-L01, Christian Schulte

27

Regular Expressions
 Epsilon

 Repetition

M*

MN

 where M and N are regular expressions


 string in language of MN, if concatenation of

 Regular expression: stands for set of strings


2005-10-25

26

2005-10-25

2G1508-L01, Christian Schulte

28

Regular Expression Examples







 denotes language just containing the empty string


 where M is regular expression
 called Kleene closure
 string in language of M*, if concatenation of zero or

a|b
(a|b)a
(ab)|
((a|b)a)*

more strings in language of M

{"a","b"}
{"aa","ba"}
{"ab",""}
{"","aa","ba",
"aaaa","aaba",
"baaa","baba",
}

2005-10-25

2G1508-L01, Christian Schulte

29

2005-10-25

2G1508-L01, Christian Schulte

30

Lexical Specification
Examples

Conventions
 Sometimes omit or

 Even binary numbers


(0|1)* 0

ab means a b
(a|) means (a|)

 Kleene closure binds tighter than


concatenation
ab*

 Strings of a's and b's with no consecutive a's


b*(abb*)*(a|)

means a(b)*

 concatenation binds tighter than alternation

 Strings of a's and b's with consecutive a's


(a|b)*aa(a|b)*

ab|c means (ab)|c

2005-10-25

2G1508-L01, Christian Schulte

31

[abcd]
[b-g]
[a-cA-C01]
M?
M+
.
"xyz+-*"

32

if
IF
[a-z][a-z0-9]*
ID
[0-9]+
NUM
([0-9]+"."[0-9]*)|([0-9]*"."[0-9]+)REAL
(" "|"\t"|"\n"|"\r")
no token

means
a|b|c|d
means
[bcdefg]
means
[abcABC01]
means
(M|)
means
(MM*)
any character but newline
stands for itself

2005-10-25

2G1508-L01, Christian Schulte

Programming Language
Token Specifications

Abbreviations








2005-10-25

2G1508-L01, Christian Schulte

error

 Lexical specification needs to be complete

33

2005-10-25

2G1508-L01, Christian Schulte

34

Disambiguation
 Does if8 match ID or IF NUM(8)?
 Disambiguation rules commonly used
 longest match
 rule priority

2005-10-25

Finite Automata

longest initial substring that can


match any regexp is token
for particular longest initial
substring, first matched regexp
determines token-type;
order is significant

2G1508-L01, Christian Schulte

35

2005-10-25

2G1508-L01, Christian Schulte

36

Finite Automata

Finite Automaton for IF

 Regular expressions for specification


 Finite automata for implementation

i
1

 Finite automaton has







finite set of states


edges leading from state to state, labeled with symbol
one start state
set of final states

2005-10-25

2G1508-L01, Christian Schulte

37

Finite Automaton for ID


a-z

 Start state: 1
 Final states: 3
2005-10-25

2G1508-L01, Christian Schulte

38

Finite Automata
 Deterministic finite automaton (DFA)

a-z

 no edges leaving from same state have same symbol


1

 Otherwise: nondeterministic finite automation


(NFA)

0-9

 Start state: 1
 Final states: 2
2005-10-25

2G1508-L01, Christian Schulte

39

Accepted Language

2005-10-25

2G1508-L01, Christian Schulte

40

Example DFA
a

 DFA accepts or rejects a string

b
2

 start from start state


 for each input character, follow exactly one edge

according to next character to next state

 no edge exists: reject


 after n transitions for an n character string: if in final

state, accept string, otherwise reject

 How does accepting a string work

 Language accepted by DFA


 set of accepted strings
2005-10-25

2G1508-L01, Christian Schulte

41

2005-10-25

2G1508-L01, Christian Schulte

42

Accepting abab

Accepting abab
a

b
2

1
4

 String to process
 State

 String to process
 State

abab
1 (start state)

2G1508-L01, Christian Schulte

43

Accepting abab

2005-10-25

bab
4

2G1508-L01, Christian Schulte

44

Accepting abab
a

b
2

1
4

1
5

 String to process
 State
2005-10-25

2005-10-25

 String to process
 State

ab
5

2G1508-L01, Christian Schulte

45

Accepting abab

2005-10-25

b
4

2G1508-L01, Christian Schulte

46

Combining DFAs
a

 Formally: little later


 Idea: label final states of each DFA with
token-type it accepts

b
2

1
4

 watch out for rule priority: label according to priority

 String to process
 State

 Implement as transition matrix

 state number character state number


 final states: bitvector, etc
 dead state: for no transition

 accept: final state!


2005-10-25

2G1508-L01, Christian Schulte

47

2005-10-25

2G1508-L01, Christian Schulte

48

Recognizing Longest Match

Nondeterministic
Finite Automata

 Keep track of longest match so far


 Remember last final state
 last final state
 position in string when at last final state

 When dead state entered


 last final state: which token matched
 position: where matching ended, where to start for
next token

2005-10-25

2G1508-L01, Christian Schulte

49

NFAs

2005-10-25

2G1508-L01, Christian Schulte

50

Example NFA
b

 NFA can have multiple edges for same


symbol
 NFA can have edges labeled with

a
2

 follow edge without eating any symbol

 How to accept?

 To process: abbb

 guessing is difficult to implement


 use trick: maintain all states that so far could have
been reached!
2005-10-25

2G1508-L01, Christian Schulte

51

Accepting abbb

2005-10-25

2G1508-L01, Christian Schulte

Accepting abbb
b

a
2

1
4

1
5

 String to process
 Set of states
2005-10-25

52

abbb
{1} (containing start state)

2G1508-L01, Christian Schulte

53

 String to process
 Set of states
2005-10-25

bbb
{2,4}

2G1508-L01, Christian Schulte

54

Accepting abbb

Accepting abbb
b

a
2

1
4

 String to process
 Set of states
2005-10-25

 String to process
 Set of states

bb
{2,3,5}

2G1508-L01, Christian Schulte

55

Accepting abbb

2005-10-25

b
{2,3}

2G1508-L01, Christian Schulte

56

NFA versus DFA


b
3

 NFA used for creating from regexp

 DFA used for processing

a
2

 bad for processing: sets are expensive!

1
4

 turn NFA into DFA: "subset" construction


 use idea as in example: sets of states, do transitions

 String to process
 Set of states

{2,3}

immediately

 accepted: final state 3{2,3}


2005-10-25

2G1508-L01, Christian Schulte

57

2005-10-25

2G1508-L01, Christian Schulte

58

Summary
 Compilers

Summary

 translate from source to target language


 have frontend and backend

 Programs executed in Execution Environment


 Lexical analysis





2005-10-25

2G1508-L01, Christian Schulte

59

2005-10-25

lexical structure: character stream to token stream


specification: regular expressions
computation: DFA
transformation from regexp to DFA: NFA

2G1508-L01, Christian Schulte

60

10

You might also like