You are on page 1of 14

Lecture Outline

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning.

Programming LanguagesSyntactic Specications and Analysis Formal Grammars Backus-Naur Form Classication of Formal Languages Syntactic Analysis of Programs Derivations Syntax Trees Ambiguity Avoiding Ambiguity Scanning Summary

September 1st, 2010

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (1/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (2/54)

Formal Grammars contd

Formal Grammars contd

How to use a grammar to generate sentences?


1. Let be a sequence containing just the start variable: = vs . 2. While contains any non-terminals, do:
2.1 Choose one non-terminal (say, v) in . 2.2 From R choose a rule (say, r) in which v appears on the left-hand side. 2.3 Replace the chosen occurence of v in with the right-hand side of r.

Example (Formal grammar)


V = {c} S = {a, b} R = {(c, ), (c, aca), (c, bcb)} vs = c Is the string abacaba valid in L? Is ababbbaba valid in L? What is the language L generated by the grammar?

3. Return .

What if contains a non-terminal v for which there is no rule in R that would have v at its left-hand side?

The grammar is incomplete.

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (3/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (4/54)

Backus-Naur Form

Backus-Naur Form contd


Elements of BNF
Terminals are distinguished from non-terminals (variables) by some typographical convention, for example:

BNF Notation

non-terminals are written in italics, using angle brackets, etc.; terminals are written in a monotype font, enclosed in quotation marks, etc. a non-terminal, a special production symbol (typically, ::=), a sequence of terminals and non-terminals, or the symbol .

Grammars are usually written using a special notation: the Backus-Naur Form (BNF). BNF is often extended with convenience symbols to shorten the notation: the Extended BNF (EBNF). BNF (and EBNF) is a metalanguage, a language for talking about languages. We will use EBNF extensively during the course.

Rules are written as strings which contain:


By convention,

the terminals and non-terminals of the grammar are those, and only those, included in at least one of the rules; the left-hand side (the rst element) of the topmost rule is the start variable vs .

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (5/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (6/54)

Backus-Naur Form contd

Backus-Naur Form contd

Example (BNF representation of a grammar, 1 )


c c c In this 1 ,

::= ::= ::=

aca bcb

Example (EBNF representation of a grammar, 1 )


The grammar can be also written as c ::= | | aca bcb

V = { c }, S = {a, b}, R = {( c , ), ( c , a c a), ( c , b c b)}, vs = c . L(1 ) = {, aa, bb, aaaa, baab, abba, bbbb, aaaaaa, baaaab, . . . }

or as c

::=

|a c a|b c b

The specied language L(1 ) is:

The special symbol | has the meaning of or, and is an element of the metalanguage, not the language specied by the grammar.

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (7/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (8/54)

Backus-Naur Form contd

Chomskys Hierarchy of Languages


Noam Chomsky dened four classes of languages:
Type 0: Unconstrained Languages

Metasyntactic extensions
Convenient extensions to the metalanguage inlcude:

the special symbols [ and ] used to enclose a subsequence that appears in the string at most once; the special symbols { and } used to enclose a subsequence that appears in the string any number of times.1

Type 1: Context-Sensitive Languages Type 2: Context-Free Languages Type 3: Regular Languages

Alternatively, we can use only the symbols { and } together with a superscript to specify the number of occurences:

{ sequence }2 means two subsequent occurences of sequence ; { sequence }+ means at least one occurence of sequence ; { sequence } means any number of occurences of sequence ;

Further extensions are possible (and are sometimes used).

The Kleene closure.


Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (9/54) Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (10/54)

Chomskys Hierarchy of Languages contd

Regular Grammars
What is a regular language?
A regular language is a language generated by a regular grammar.

In a regular grammar, all rules are of one of the forms:2 v v v ::= ::= ::= s v s

Note:

All regular languages are context-free, but not all context-free languages are regular. All context-free languages are context-sensitive [sic], but not all context-sensitive languages are context-free. etc.

where s S; v, v V; and it is not required that v = v .

Example (A regular grammar)


string

This may sound unintuitive, but it follows a well-established convention.

substring

::= ::=

a substring | b substring | c substring

Regular grammars are conveniently expressed with regular expressions. The above could be written as (a|b)c*, (?:a|b)c*, or [ab]c*, etc.
2 These are right-regular grammars. In left-regular grammars, the rst rule form above is replaced by v ::= v s. Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (11/54) Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (12/54)

Context-Free Grammars

Context-Sensitive and Unconstrained Languages


What is a context-sensitive language?
A context-sensitive language is a language generated by a context-sensitive grammar.

What is a context-free language?


A context-free language is a language generated by a context-free grammar.

In a context-free grammar, all rules are of the form: v

::=

In a context-sensitive grammar, all rules are of the form: v ::=

where v V and (V S) (the set of all sequences of variables from V and symbols from S).3

where v V, and , , (V S) .

Example (A non-regular context-free grammar)


expression ::= | | | number expression operator ( expression ) ... expression

What is an unconstrained language?


An unconstrained language is a language generated by an unrestricted grammar.

In an unrestricted grammar, all rules are of the form: ::= where , (V S) and is non-empty.

(V S) is the Kleene closure of V S.


Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (13/54) Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (14/54)

Chomskys Hierarchy of Languages contd


Why care about the hierarchy of languages?

Lecture Outline

Different grammars have different computational complexity:


unconstrained context-sensitive context-free regular

Regular grammars are commonly used to dene the microsyntax of programming languagesthe syntax of lexemes as sequences of symbols from the alphabet of characters.4 Context-free grammars are used to dene (macro)syntax of programming languagesthe syntax of programs as sequences of symbols from the alphabet of tokens (classied lexemes).5 Additional constraints may be needed to further constrain the syntax, e.g., by specifying that variable identiers can be used only after they have been declared, etc.6

Programming LanguagesSyntactic Specications and Analysis Formal Grammars Backus-Naur Form Classication of Formal Languages Syntactic Analysis of Programs Derivations Syntax Trees Ambiguity Avoiding Ambiguity Scanning Summary

CTMCP uses the term lexical syntax rather than microsyntax; others use the term lexical structure. 5 Macrosyntax is usually referred to as syntactic structure. 6 The less restrictive the metalanguage used to dene the grammar, the more restrictive can the grammar be wrt. to the specied language.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (15/54) Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (16/54)

Syntactic Analysis of Programs

Syntactic Analysis of Programs contd


How are programs processed? contd
Program: if X == 1 then . . . Input: i f X = = t h e n . . . Lexemization: if X == 1 then . . . Tokenization: key(if) var(X) op(==) int(1) key(then) . . . Parsing: program(ifthenelse(eq(var(X) int(1)) ... ...) ...) Interpretation: actions according to the program and language semantics Compilation: code generation according to the program and language semantics

How are programs processed?

The initial input is linearit is a sequence of symbols from the alphabet of characters. A lexical analyzer (scanner, lexer, tokenizer) reads the sequence of characters and outputs a sequence of tokens. A parser reads a sequence of tokens and outputs a structured (typically non-linear) internal representation of the programa syntax tree (parse tree). The syntax tree is further processed, e.g., by an interpreter or by a compiler.

We have seen some of these steps implemented in the mdc interpreter.7

7 There, both the microsyntax and the syntax were trivial, no parsing was really needed as the intermediate representation was linear and colinear with the list of tokens, and no compilation was developed. Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (17/54) Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (18/54)

Syntactic Analysis of Programs contd


Example (Partial microsyntax of Oz, using Perl-style regexes)
variable ::= [A..Z][A..Za..z0..9_]*

Syntactic Analysis of Programs contd

Example (Partial syntax of Oz)


statement ::= | | skip if variable then statement else statement end ...

A variable (a variable name) consists of an uppercase letter followed by any number of word characters.

Variable is valid as a variable name, atom and 123 are not.

Example (Partial microsyntax of Oz, using POSIX classes)


atom ::= [[:lower:]][[:word:]]* additional constraint: no keyword is an atom

where skip, if, then, else, and end are symbols from the alphabet of lexemes.

if X then skip else if Y then skip else skip end end is a valid statement in Oz; if X then skip end and if x then skip else skip end are not.8

An atom consists of a lowercase letter followed by any number of word characters.

variable is valid as an atom, Atom and 123 are not.


8 The former is not valid in the Oz kernel language, but is valid in the syntactically extended version. Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (19/54) Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (20/54)

Syntactic Analysis of Programs contd


Note: It is convenient to use indentation to make the structure of a program clear to the programmer, but (in Oz) this is inessential for the syntactic and semantic validity of programs.

Syntactic Analysis of Programs contd


Note: In some programming languages indentation is essential for the syntactic and semantic validity of programs.

Example (Indentation in Oz)


if A then skip else if B then if C then skip else skip end else skip end end

Example (Indentation in Python)


# valid function definition def foo(bar): print bar return foo # invalid def foo(bar): print bar return foo # invalid def foo(bar): print bar return foo

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (21/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (22/54)

Syntactic Analysis of Programs contd

Derivations

Note: In some programming languages the programmer has control of whether indentation is essential for the syntactic and semantic validity of programs or not.

Derivations
Following the recipe for using a grammar explained earlier, we can derive sentences in the language L( ) specied by a grammar in a sequence of steps.

Example (Indentation in F#)


(* valid, no indentation required *) let hello = fun name -> printf "hello, %a" name

In each step we transform one sentential form (a sequence of terminals and/or non-terminals) into another sentential form by replacing one non-terminal with the right-hand side of a matching rule. The rst sentential form is the start variable vs alone. The last sentential form is a valid sentence, composed only of terminals.

(* invalid, 4-space indentation required *) #light let hello = fun name -> printf "hello, %a" name

Sequences of sentential forms starting with vs and ending with a sentence in L( ) obtained as specied above are called derivations.

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (23/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (24/54)

Derivations contd
The following are two of innitely many derivations possible to obtain with the previously dened grammar 1 .9

Derivations contd

Rightmost and leftmost derivations


A derivation is a sequence of sentential forms beginning with a single nonterminal and ending with a (valid) sequence of terminals.

Example (Derivation using 1 )


1. c 2. a c a 3. ab c ba 4. abba

A derivation such that in each step it is the leftmost non-terminal that is replaced is called a leftmost derivation. A derivation such that in each step it is the rightmost non-terminal that is replaced is called a rightmost derivation. There can be derivations that are neither leftmost nor rightmost.

Example (Derivation using 1 )


1. c 2.

Given a start variable v and a sequence s of terminals, there can be


no derivation of s from v (if s is not valid in the dened language); exactly one derivation of s from v; more than one derivation.

c ::= | a c a | b c b .
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (25/54) Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (26/54)

Derivations contd
Example (A leftmost derivation)
1. statement 2. if variable then statement else statement end 3. if A then statement else statement end 4. if A then skip else statement end 5. if A then skip else if variable then statement else statement end end ... 11. if A then skip else if B then if C then else skip end else skip end end

Derivations contd

Example (A rightmost derivation)


1. statement 2. if variable then statement else statement end 3. if variable then statement else if variable then statement else statement end end ... 11. if A then skip else if B then if C then else skip end else skip end end

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (27/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (28/54)

Syntax Trees

Syntax Trees
Example (Syntax tree)

Syntax tree
A parse tree (a syntax tree) is a structured representation of a program.

Let have the following rule(s): v ::= |a v | v b| v v

Parse trees are generate in the process of parsing programs. A parser is a function (a program) that takes as input a sequence of tokens (the output of a lexer) and returns a nested data structure corresponding to a parse tree.

Does the sequence ba belong to L( )? Yes, it has the following parse tree: v v v b a v v

The data structure returned by the parser is an internal (intermediate) representation of the program. A parse tree can be used to:

interpret the program (in interpreted langagues); generate target code (in compiled languages); optimize the intermediate code (in both interpreted and compiled languages).

How many distinct derivations lead from v to ba?

There are six such derivations (check this!).

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (29/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (30/54)

Syntax Trees contd


Example (A simple syntax tree for Oz)
The Oz grammar includes the following rules: statement ::= | skip if variable then statement else statement end

Syntax Trees contd

Suppose we rewrite the grammar above as statement ::= | | skip if variable then statement else statement if variable then statement

with the microsyntactic denition of variable given earlier. What is the parse tree for if A then skip else if B then skip else skip end end?
statement

if

variable

then

statement

else

statement

end

How many syntax trees does if A then if B then skip else skip have, given this grammar? There are two parse trees for this sequencesee the next slide.
statement end

skip if

variable

then

statement

else

skip

skip

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (31/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (32/54)

Syntax Trees contd


Example (Parse tree for if A then if B then skip else skip)
statement if variable A if then variable B statement then statement skip else statement skip

Syntax Trees contd

Does it matter that a sentence has more than one parse tree?

For a sentence like if A then if B then skip else skip where all the conditional actions are skip (do nothing, noop), it does not matter much.

Example (Parse tree for if A then if B then skip else skip)


statement if variable A if then statement then else statement skip statement skip

In general, it does matter, since what actions will be taken and in which order depends on how the program is understood by the interpreter (or compiler), which in turn depends on how the program is parsed. the specication of the syntax is unambiguous, and the programmer does not make false assumptions about how the code will be parsed.

It is therefore essential that

variable B

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (33/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (34/54)

Syntax Trees contd


Example (The if-then-else construct in Python)
Given these two pieces of code, what is the output for each possible combination of values if both a and b can have a value from {True, False}?
1. if a: if b: print 1 else: print 2 if a: if b: print 1 else: print 2

Syntax Trees contd


Example (Multistatement lines in Python)
In Python, colon (;) can be used to separate multiple statements within one line.10 Which of the following are equivalent?
1. 2. if a: print 1; print 2 if a: print 1 print 2 if a: print 1 print 2

2.

3.

a = True, b = True: both print 1 a = True, b = False: the rst prints 2, the second nothing a = False, b = True: the second prints 2, the rst nothing a = False, b = False: the second prints 2, the rst nothing

1. is equivalent to 2. What about if a: if b: print 1; else print 2?11

The lack of end would add to the grammar ambiguity which is resolved by involving whitespace in the specication.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (35/54)

10 11

Multistatement lines are considered bad practice in Python. Invalid syntax.


Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (36/54)

Ambiguity

Ambiguity contd

Ambiguity
A grammar is ambiguous if a sentence can be parsed in more than one way:

Example (An ambiguous grammar)


Let exp be a grammar including the following rules: expression operator ::= | ::= integer expression -|+|*|/ operator expression

the program has more than one parse tree, that is, the program has more than one leftmost derivation.12

Note: The fact that a program has more than one derivation is not sufcient to consider the grammar ambiguous. In practice, most programs have more than one derivation, but all these derivations correspond to the same parse treethe grammar is unambiguous. Two distinct leftmost derivations for the same program must correspond to two distinct parse treesthe grammar must be ambiguous in this case.

where integer may generate any integer numeral (a sequence of digits). Why is exp ambiguous?

Sentences like 1 + 2 + 3 have more than one parse tree. Worse, sentences like 1 + 2 * 3 have more than one parse tree. In Smalltalk, the result would be 9. In general, we would like it to be 7.

Should 1 + 2 * 3 evaluate to 9 or to 7?

12

Or more than one rightmost derivation.


Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (37/54) Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (38/54)

Ambiguity contd
Example (An ambiguous grammar contd)
The expression 1 + 2 * 3 has two parse trees:
expression expression integer 1 operator expression integer 2 expression expression expression integer 1 operator expression integer 2 operator * expression integer 3 expression operator * expression integer 3

Avoiding Ambiguity

There are a number of ways to avoid ambiguity in grammars. Here, we consider four alternative solutions.

Solution 1: Obligatory parentheses


We can modify exp by enforcing parentheses around complex expressions: expression operator ::= | ::= integer ( expression operator expression ) -|+|*|/

Benet: Ambiguity has been resolved. Drawback: Expressions such as 1 + 2 * 3, or even 1 + 2, are no longer legal. (We must type (1 + (2 * 3)) and (1 + 2) instead.)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (39/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (40/54)

Avoiding Ambiguity
Solution 2: Precedence of operators
We can modify exp by distinguishing operators of high and low priority: expression term ::= | ::= | | ::= ::= term expression lp-operator expression integer ( expression ) term hp-operator term *|/ +|-

Avoiding Ambiguity

Solution 3: Associativity of operators


We can modify exp by introducing associativity of operators: expression operator ::= ::= integer | expression *|/|+|operator integer

hp-operator lp-operator

where hp-operator and lp-operator are high-priority and low-priority operators, respectively. Benet: Expressions such as 1 + 2 * 3 can be (partially) parsed as 1 + expression but not as expression * 3. Drawback: An expression like 1 - 2 - 3 is still ambiguous: it can be (partially) parsed both as expression - 3 and as 1 - expression .

Benet: The operators in this grammar are left-associative; the expression 1 - 2 - 3 can only be (partially) parsed as expression - 3, and not as 1 - expression . Drawback: All operators have equal precedence; an expression like 1 - 2 * 3 can only be (partially) parsed as expression * 3, and not as 1 - expression .

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (41/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (42/54)

Ambiguity contd

Scanning

Solution 4: Combine associativity, precedence, and parentheses


We can modify exp by adding all of the above: expression term factor hp-operator lp-operator ::= | ::= | ::= | ::= ::= term expression hp-operator term factor term lp-operator factor integer ( expression ) *|/ +|-

What is scanning?
Scanning is the process of translating programs from the string-of-characters input format into the sequence-of-tokens intermediate format. We have seen scanning in action in the mdc example:

the lexemizer took as input a string of characters and returned a sequence of lexemes; the tokenizer took as input a sequence of lexemes and returned a sequence of tokens.

These two steps are usually merged into one pass, called scanning (but sometimes even lexing, or tokenization is used about both operations, and scanning may be used for only creating the lexemes).

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (43/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (44/54)

Scanning contd
How do we design and implement a scanner?
Building a scanner requires a number of steps: 1. Specication of the microsyntax (the lexical structure) of the language, typically using regular expressions (regexes).

Scanning contd
Before we implement an mdc scanner, we rst have a look at a recognizer for mdc lexemes.

A scanner processes an input string and returns a list of lexemes (or tokens). A recognizer checks whether the whole input string is a single lexeme.

2. Based on the regexes, a nondeterministic nite automaton (NFA) is built that recognizes lexemes of the language. 3. A deterministic nite automaton (DFA) equivalent to the NFA is built. 4. The DFA is implemented using a nested control stucture that processes the input one character at a time. All steps can be realized manually, but there exist tools which

Example (A recognizer for mdc lexemes)


Step 1: The microsyntax of mdc is trivially specied with the following regular expressions: command operator integer ::= ::= ::= [pf] (exactly one p or one f) [\+\-\*\/] (analogously, symbols escaped with \) [0..9]+ (one or more digits)

allow one to specify the lexical structure using regular expressions, and build an implementation of the DFA automatically.

We shall revisit the mdc example and build a scanner both manually and using a scanner-building tool.

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (45/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (46/54)

Scanning contd
Example (A recognizer for mdc lexemes contd)
Step 3: The regex specication is realized by the following DFA:
13

Scanning contd
Example (A recognizer for mdc lexemes contd)
Step 4: An algorithm for the mdc recognizer DFA:14
input: string of characters; output: boolean state start; char next() while char = EOF: if state = start: if char {p, f}: state cmd else if char {+, -, *, /}: state op else if char {0, . . . , 9}: state int else: return false else if state {cmd, op}: return false else if state = int: if char {0, . . . , 9}: return false / char next() if state {cmd, op, int}: return true else: return false

cmd p, f +, -, *, / start 0, . . . , 9

op

int

0, . . . , 9

13

We skip Step 2; see the further reading section for references if you need more details.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (47/54)

14 Notation varies. EOF means end of le (input). Each call to next() returns the next character from the input. Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (48/54)

Scanning contd

Scanning contd

The recognizer checks whether the whole string is a single lexeme, but we want more:

process strings that include more than one lexeme; return a sequence of classied lexemes rather than a yes/no answer.

In the previous implementation of mdc, all lexemes in a program had to be separated by whitespace. This leads to a tradeoff:

it is more convenient to implement the lexemizerjust split the input by whitespace; it is less convenient to use the languagethe programmer must separate all lexemes with whitespace.

We shall now develop a scanner that makes whitespace between lexemes optional (unless we want to separate two numerals).

Try it! The le code/mdc-recognizer.oz contains an implementation of the mdc recognizer and a few simple test cases. Open the le in the OPI (oz &, then C-x C-f). Execute the code (C-. C-b). What happens? {MDCRecognizer "p"} evaluates to true, because the input is a command. {MDCRecognizer "123"} evaluates to true, because the input is an integer. {MDCRecognizer "1 2 +"} evaluates to false, because the input is not a valid lexeme, even though it is a valid sentence (legal sequence of valid lexemes) in mdc.

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (49/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (50/54)

Scanning contd
Example (A scanner for mdc)
Step 4: An algorithm for the mdc scanner DFA:15
input: string of characters; output: sequence of tokens tokens (); state start; char next(); seen while char = EOF: if state = start: if char {p, f}: append cmd, char to tokens else if char {+, -, *, /}: append op, char to tokens else if char {0, . . . , 9}: state int; seen char else if char S: error(char) / char next() else if state = int: if char {0, . . . , 9}: concatenate char to seen; char next() else: append int, seen to tokens; seen (); state start if state = int: append int, seen to tokens return tokens

Lecture Outline

Programming LanguagesSyntactic Specications and Analysis Formal Grammars Backus-Naur Form Classication of Formal Languages Syntactic Analysis of Programs Derivations Syntax Trees Ambiguity Avoiding Ambiguity Scanning Summary

15 tokens maintains a list of tokens recognized so far. seen maintains a string of characters seen since the most recently recognized token. Angle brackets ( and ) denote tokens (class-lexeme pairs). Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (51/54) Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (52/54)

Summary

Summary contd

Homework This time


Examine and try out todays code, read Mozart/Oz documentation if necessary. Most of todays slides, except for implementational details of mdc scanners and the recognizer and scanner DFA. See, e.g., Ch. 3 in Sebesta Concepts of Programming Languages; Ch. 2 in Scott Programming Language Pragmatics; Ch. 24 in Copper and Torczon Engineering a Compiler (a detailed, in-depth but readable presentation). ...? ...? ...?

syntax, grammars, derivations, parse trees, ambiguity recognizing, scanning design and implementation of an mdc scanner

Pensum

Note! The code examples are used as an illustration; we will return to (some parts of) them when you learn more about the syntax and semantics of Oz. Next time

Further reading

syntax and semantics of the declarative kernel language Questions


Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (53/54)

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (54/54)

You might also like