Professional Documents
Culture Documents
Programming LanguagesSyntactic Specications and Analysis Formal Grammars Backus-Naur Form Classication of Formal Languages Syntactic Analysis of Programs Derivations Syntax Trees Ambiguity Avoiding Ambiguity Scanning Summary
V = {c} S = {a, b} R = {(c, ), (c, aca), (c, bcb)} vs = c Is the string abacaba valid in L? Is ababbbaba valid in L? What is the language L generated by the grammar?
3. Return .
What if contains a non-terminal v for which there is no rule in R that would have v at its left-hand side?
Backus-Naur Form
BNF Notation
non-terminals are written in italics, using angle brackets, etc.; terminals are written in a monotype font, enclosed in quotation marks, etc. a non-terminal, a special production symbol (typically, ::=), a sequence of terminals and non-terminals, or the symbol .
Grammars are usually written using a special notation: the Backus-Naur Form (BNF). BNF is often extended with convenience symbols to shorten the notation: the Extended BNF (EBNF). BNF (and EBNF) is a metalanguage, a language for talking about languages. We will use EBNF extensively during the course.
By convention,
the terminals and non-terminals of the grammar are those, and only those, included in at least one of the rules; the left-hand side (the rst element) of the topmost rule is the start variable vs .
aca bcb
V = { c }, S = {a, b}, R = {( c , ), ( c , a c a), ( c , b c b)}, vs = c . L(1 ) = {, aa, bb, aaaa, baab, abba, bbbb, aaaaaa, baaaab, . . . }
or as c
::=
|a c a|b c b
The special symbol | has the meaning of or, and is an element of the metalanguage, not the language specied by the grammar.
Metasyntactic extensions
Convenient extensions to the metalanguage inlcude:
the special symbols [ and ] used to enclose a subsequence that appears in the string at most once; the special symbols { and } used to enclose a subsequence that appears in the string any number of times.1
Alternatively, we can use only the symbols { and } together with a superscript to specify the number of occurences:
{ sequence }2 means two subsequent occurences of sequence ; { sequence }+ means at least one occurence of sequence ; { sequence } means any number of occurences of sequence ;
Regular Grammars
What is a regular language?
A regular language is a language generated by a regular grammar.
In a regular grammar, all rules are of one of the forms:2 v v v ::= ::= ::= s v s
Note:
All regular languages are context-free, but not all context-free languages are regular. All context-free languages are context-sensitive [sic], but not all context-sensitive languages are context-free. etc.
substring
::= ::=
Regular grammars are conveniently expressed with regular expressions. The above could be written as (a|b)c*, (?:a|b)c*, or [ab]c*, etc.
2 These are right-regular grammars. In left-regular grammars, the rst rule form above is replaced by v ::= v s. Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (11/54) Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (12/54)
Context-Free Grammars
::=
where v V and (V S) (the set of all sequences of variables from V and symbols from S).3
where v V, and , , (V S) .
In an unrestricted grammar, all rules are of the form: ::= where , (V S) and is non-empty.
Lecture Outline
Regular grammars are commonly used to dene the microsyntax of programming languagesthe syntax of lexemes as sequences of symbols from the alphabet of characters.4 Context-free grammars are used to dene (macro)syntax of programming languagesthe syntax of programs as sequences of symbols from the alphabet of tokens (classied lexemes).5 Additional constraints may be needed to further constrain the syntax, e.g., by specifying that variable identiers can be used only after they have been declared, etc.6
Programming LanguagesSyntactic Specications and Analysis Formal Grammars Backus-Naur Form Classication of Formal Languages Syntactic Analysis of Programs Derivations Syntax Trees Ambiguity Avoiding Ambiguity Scanning Summary
CTMCP uses the term lexical syntax rather than microsyntax; others use the term lexical structure. 5 Macrosyntax is usually referred to as syntactic structure. 6 The less restrictive the metalanguage used to dene the grammar, the more restrictive can the grammar be wrt. to the specied language.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (15/54) Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (16/54)
The initial input is linearit is a sequence of symbols from the alphabet of characters. A lexical analyzer (scanner, lexer, tokenizer) reads the sequence of characters and outputs a sequence of tokens. A parser reads a sequence of tokens and outputs a structured (typically non-linear) internal representation of the programa syntax tree (parse tree). The syntax tree is further processed, e.g., by an interpreter or by a compiler.
7 There, both the microsyntax and the syntax were trivial, no parsing was really needed as the intermediate representation was linear and colinear with the list of tokens, and no compilation was developed. Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (17/54) Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (18/54)
A variable (a variable name) consists of an uppercase letter followed by any number of word characters.
where skip, if, then, else, and end are symbols from the alphabet of lexemes.
if X then skip else if Y then skip else skip end end is a valid statement in Oz; if X then skip end and if x then skip else skip end are not.8
Derivations
Note: In some programming languages the programmer has control of whether indentation is essential for the syntactic and semantic validity of programs or not.
Derivations
Following the recipe for using a grammar explained earlier, we can derive sentences in the language L( ) specied by a grammar in a sequence of steps.
In each step we transform one sentential form (a sequence of terminals and/or non-terminals) into another sentential form by replacing one non-terminal with the right-hand side of a matching rule. The rst sentential form is the start variable vs alone. The last sentential form is a valid sentence, composed only of terminals.
(* invalid, 4-space indentation required *) #light let hello = fun name -> printf "hello, %a" name
Sequences of sentential forms starting with vs and ending with a sentence in L( ) obtained as specied above are called derivations.
Derivations contd
The following are two of innitely many derivations possible to obtain with the previously dened grammar 1 .9
Derivations contd
A derivation such that in each step it is the leftmost non-terminal that is replaced is called a leftmost derivation. A derivation such that in each step it is the rightmost non-terminal that is replaced is called a rightmost derivation. There can be derivations that are neither leftmost nor rightmost.
no derivation of s from v (if s is not valid in the dened language); exactly one derivation of s from v; more than one derivation.
c ::= | a c a | b c b .
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (25/54) Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (26/54)
Derivations contd
Example (A leftmost derivation)
1. statement 2. if variable then statement else statement end 3. if A then statement else statement end 4. if A then skip else statement end 5. if A then skip else if variable then statement else statement end end ... 11. if A then skip else if B then if C then else skip end else skip end end
Derivations contd
Syntax Trees
Syntax Trees
Example (Syntax tree)
Syntax tree
A parse tree (a syntax tree) is a structured representation of a program.
Parse trees are generate in the process of parsing programs. A parser is a function (a program) that takes as input a sequence of tokens (the output of a lexer) and returns a nested data structure corresponding to a parse tree.
Does the sequence ba belong to L( )? Yes, it has the following parse tree: v v v b a v v
The data structure returned by the parser is an internal (intermediate) representation of the program. A parse tree can be used to:
interpret the program (in interpreted langagues); generate target code (in compiled languages); optimize the intermediate code (in both interpreted and compiled languages).
Suppose we rewrite the grammar above as statement ::= | | skip if variable then statement else statement if variable then statement
with the microsyntactic denition of variable given earlier. What is the parse tree for if A then skip else if B then skip else skip end end?
statement
if
variable
then
statement
else
statement
end
How many syntax trees does if A then if B then skip else skip have, given this grammar? There are two parse trees for this sequencesee the next slide.
statement end
skip if
variable
then
statement
else
skip
skip
Does it matter that a sentence has more than one parse tree?
For a sentence like if A then if B then skip else skip where all the conditional actions are skip (do nothing, noop), it does not matter much.
In general, it does matter, since what actions will be taken and in which order depends on how the program is understood by the interpreter (or compiler), which in turn depends on how the program is parsed. the specication of the syntax is unambiguous, and the programmer does not make false assumptions about how the code will be parsed.
variable B
2.
3.
a = True, b = True: both print 1 a = True, b = False: the rst prints 2, the second nothing a = False, b = True: the second prints 2, the rst nothing a = False, b = False: the second prints 2, the rst nothing
The lack of end would add to the grammar ambiguity which is resolved by involving whitespace in the specication.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (35/54)
10 11
Ambiguity
Ambiguity contd
Ambiguity
A grammar is ambiguous if a sentence can be parsed in more than one way:
the program has more than one parse tree, that is, the program has more than one leftmost derivation.12
Note: The fact that a program has more than one derivation is not sufcient to consider the grammar ambiguous. In practice, most programs have more than one derivation, but all these derivations correspond to the same parse treethe grammar is unambiguous. Two distinct leftmost derivations for the same program must correspond to two distinct parse treesthe grammar must be ambiguous in this case.
where integer may generate any integer numeral (a sequence of digits). Why is exp ambiguous?
Sentences like 1 + 2 + 3 have more than one parse tree. Worse, sentences like 1 + 2 * 3 have more than one parse tree. In Smalltalk, the result would be 9. In general, we would like it to be 7.
Should 1 + 2 * 3 evaluate to 9 or to 7?
12
Ambiguity contd
Example (An ambiguous grammar contd)
The expression 1 + 2 * 3 has two parse trees:
expression expression integer 1 operator expression integer 2 expression expression expression integer 1 operator expression integer 2 operator * expression integer 3 expression operator * expression integer 3
Avoiding Ambiguity
There are a number of ways to avoid ambiguity in grammars. Here, we consider four alternative solutions.
Benet: Ambiguity has been resolved. Drawback: Expressions such as 1 + 2 * 3, or even 1 + 2, are no longer legal. (We must type (1 + (2 * 3)) and (1 + 2) instead.)
Avoiding Ambiguity
Solution 2: Precedence of operators
We can modify exp by distinguishing operators of high and low priority: expression term ::= | ::= | | ::= ::= term expression lp-operator expression integer ( expression ) term hp-operator term *|/ +|-
Avoiding Ambiguity
hp-operator lp-operator
where hp-operator and lp-operator are high-priority and low-priority operators, respectively. Benet: Expressions such as 1 + 2 * 3 can be (partially) parsed as 1 + expression but not as expression * 3. Drawback: An expression like 1 - 2 - 3 is still ambiguous: it can be (partially) parsed both as expression - 3 and as 1 - expression .
Benet: The operators in this grammar are left-associative; the expression 1 - 2 - 3 can only be (partially) parsed as expression - 3, and not as 1 - expression . Drawback: All operators have equal precedence; an expression like 1 - 2 * 3 can only be (partially) parsed as expression * 3, and not as 1 - expression .
Ambiguity contd
Scanning
What is scanning?
Scanning is the process of translating programs from the string-of-characters input format into the sequence-of-tokens intermediate format. We have seen scanning in action in the mdc example:
the lexemizer took as input a string of characters and returned a sequence of lexemes; the tokenizer took as input a sequence of lexemes and returned a sequence of tokens.
These two steps are usually merged into one pass, called scanning (but sometimes even lexing, or tokenization is used about both operations, and scanning may be used for only creating the lexemes).
Scanning contd
How do we design and implement a scanner?
Building a scanner requires a number of steps: 1. Specication of the microsyntax (the lexical structure) of the language, typically using regular expressions (regexes).
Scanning contd
Before we implement an mdc scanner, we rst have a look at a recognizer for mdc lexemes.
A scanner processes an input string and returns a list of lexemes (or tokens). A recognizer checks whether the whole input string is a single lexeme.
2. Based on the regexes, a nondeterministic nite automaton (NFA) is built that recognizes lexemes of the language. 3. A deterministic nite automaton (DFA) equivalent to the NFA is built. 4. The DFA is implemented using a nested control stucture that processes the input one character at a time. All steps can be realized manually, but there exist tools which
allow one to specify the lexical structure using regular expressions, and build an implementation of the DFA automatically.
We shall revisit the mdc example and build a scanner both manually and using a scanner-building tool.
Scanning contd
Example (A recognizer for mdc lexemes contd)
Step 3: The regex specication is realized by the following DFA:
13
Scanning contd
Example (A recognizer for mdc lexemes contd)
Step 4: An algorithm for the mdc recognizer DFA:14
input: string of characters; output: boolean state start; char next() while char = EOF: if state = start: if char {p, f}: state cmd else if char {+, -, *, /}: state op else if char {0, . . . , 9}: state int else: return false else if state {cmd, op}: return false else if state = int: if char {0, . . . , 9}: return false / char next() if state {cmd, op, int}: return true else: return false
cmd p, f +, -, *, / start 0, . . . , 9
op
int
0, . . . , 9
13
We skip Step 2; see the further reading section for references if you need more details.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (47/54)
14 Notation varies. EOF means end of le (input). Each call to next() returns the next character from the input. Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (48/54)
Scanning contd
Scanning contd
The recognizer checks whether the whole string is a single lexeme, but we want more:
process strings that include more than one lexeme; return a sequence of classied lexemes rather than a yes/no answer.
In the previous implementation of mdc, all lexemes in a program had to be separated by whitespace. This leads to a tradeoff:
it is more convenient to implement the lexemizerjust split the input by whitespace; it is less convenient to use the languagethe programmer must separate all lexemes with whitespace.
We shall now develop a scanner that makes whitespace between lexemes optional (unless we want to separate two numerals).
Try it! The le code/mdc-recognizer.oz contains an implementation of the mdc recognizer and a few simple test cases. Open the le in the OPI (oz &, then C-x C-f). Execute the code (C-. C-b). What happens? {MDCRecognizer "p"} evaluates to true, because the input is a command. {MDCRecognizer "123"} evaluates to true, because the input is an integer. {MDCRecognizer "1 2 +"} evaluates to false, because the input is not a valid lexeme, even though it is a valid sentence (legal sequence of valid lexemes) in mdc.
Scanning contd
Example (A scanner for mdc)
Step 4: An algorithm for the mdc scanner DFA:15
input: string of characters; output: sequence of tokens tokens (); state start; char next(); seen while char = EOF: if state = start: if char {p, f}: append cmd, char to tokens else if char {+, -, *, /}: append op, char to tokens else if char {0, . . . , 9}: state int; seen char else if char S: error(char) / char next() else if state = int: if char {0, . . . , 9}: concatenate char to seen; char next() else: append int, seen to tokens; seen (); state start if state = int: append int, seen to tokens return tokens
Lecture Outline
Programming LanguagesSyntactic Specications and Analysis Formal Grammars Backus-Naur Form Classication of Formal Languages Syntactic Analysis of Programs Derivations Syntax Trees Ambiguity Avoiding Ambiguity Scanning Summary
15 tokens maintains a list of tokens recognized so far. seen maintains a string of characters seen since the most recently recognized token. Angle brackets ( and ) denote tokens (class-lexeme pairs). Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (51/54) Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (52/54)
Summary
Summary contd
Examine and try out todays code, read Mozart/Oz documentation if necessary. Most of todays slides, except for implementational details of mdc scanners and the recognizer and scanner DFA. See, e.g., Ch. 3 in Sebesta Concepts of Programming Languages; Ch. 2 in Scott Programming Language Pragmatics; Ch. 24 in Copper and Torczon Engineering a Compiler (a detailed, in-depth but readable presentation). ...? ...? ...?
syntax, grammars, derivations, parse trees, ambiguity recognizing, scanning design and implementation of an mdc scanner
Pensum
Note! The code examples are used as an illustration; we will return to (some parts of) them when you learn more about the syntax and semantics of Oz. Next time
Further reading