CS252-Slides-2015-to..

Report
CS252: Systems Programming
Ninghui Li
Topic 5: Parsing
Prepared by Evan Hanau
[email protected]
Introduction to Parsing with Yacc
An Introduction to Parsing with Yacc
•
Context-Free Grammars
• Yacc Parsing
• An example Infix Calculator Program
Context-Free Grammar
Background: The Context-Free Grammar
• By CS252 you are already somewhat
familiar with Regular Expressions.
• Regular expressions can be used to
describe regular languages, which belong
to a larger classification of language types.
Context-Free Grammar
In CS, we classify languages on the Chomsky
Hierarchy.
Type-0 Recursively Enumerable
Type-1 Context-Sensitive
Type-2 Context-Free
Type-3 Regular
Type-(i) is a superset of Type-(i+1)
Context-Free Grammar
• Languages generated by regular
expressions belong to type 3.
• Note: Your specific regular expression
engine (e.g. POSIX extended RE) is likely
capable of more complex productions.
• In any case, we need more than regular
expressions to parse computer
programming languages and shell scripts.
Context-Free Grammar
• You can do a great deal with regular
expressions.
• Exercise: Create a regular expression that
matches on any English phrase that is a
palindrome, for instance the string “some
men interpret nine memos”.
Context-Free Grammar
•
This is in fact not possible with regex (by its strict CS
definition!). You would be limited to palindromes of
a finite length only.
•
•
RE cannot express “anbn”, a string with some number of
a’s followed by equal number of b’s
• The expression a*b* does not require number of a’s
equal that of b’s
We must use a context-free grammar to describe
palindromes and other constructs.
• More powerful than a regular expression, and useful
when some notion of “what came before” is required.
Backus-Naur form
•
BNF or Backus-Naur form is used in CS to
describe context-free grammars. It is often used
to describe the syntax of programming
languages.consists of one or more of the
following:
<nonterminal> ::= __expression__
•
Where __expression__ consists of one or more
terminals and nonterminals or nothing (epsilon).
US Post Address in Backus-Naur
form (from wikipedia)
<postal-address> ::= <name-part> <street-address> <zip-part>
<name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>
| <personal-part> <name-part>
<personal-part> ::= <first-name> | <initial> "."
<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>
<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
<opt-apt-num> ::= <apt-num> | ""
Context-Free Grammar for
Simple Expressions
Let’s define a grammar for a primitive add or
multiply expression:
<expr> ::= <expr> * <expr>
| <expr> + <expr>
| number
In this case, <expr> is a terminal and number,
*, and + are the nonterminals.
Context-Free Grammar
• Clearly, there is some ambiguity here,
because operator precedence (sometimes
referred to as binding) is not defined.
• The grammar does not distinguish between
2+2*2+2 = 16 (incorrect under normal
rules) or 2+2*2+2 = 8 (correct).
Context-Free Grammar
One Solution: Define expressions of different
levels:
<expr> ::= <add_expr>
<add_expr> ::= <add_expr> + <mul_expr>
| <mul_expr>
<mul_expr> ::= <mul_expr> * number
| number
Now, multiplication will bind tighter than
addition (this may require a few sample
expressions to wrap your head around!)
Context-Free Grammar
Associativity follows from the above example
(Hint: What side of the multiply and add
operation did we have the “deeper”
production on?)
CFG for palindrome
<palindrome> ::= letter
| 
// empty string
<palindrome> ::= “a” <palindrome> “a”
| “b” <palindrome> “b”
….
Or, <palindrome> ::= letter <palindrome> letter
However, we need to check the two letters are the same.
CFG for anbn:
<anbn> = 
// empty string
| “a” <anbn> “b”
Chomsky Hierarchy (From
Wikipedia Page)
Grammar
Languages
Automaton
Production rules
(constraints)
Recursively
enumerable
Turing machine
(no restrictions)
Context-sensitive
Linear-bounded
non-deterministic
Turing machine
(equivalently,
Right side no
shorter than left)
Type-2
Context-free
Nondeterministic push
down automaton
Type-3
Regular
Finite state
automaton
Type-0
Type-1
and
Chomsky Hierarchy Revisited.
Type-0 Recursively Enumerable
Type-1 Context-Sensitive
Cannot encode all strings r1r2 such that r1 and r2
are two regular expressions that are equivalent
Type-2 Context-Free (Pushdown Automaton, i.e,
Finite State Automaton with a Stack)
Can encode anbn, but not anbncn
Type-3 Regular (Finite State Automaton)
Can encode a*b*, but not anbn
Why is Context-Free Grammar
Called Context Free?
In a CFG, the left hand of each production is a single
non-terminal, e.g.,
<palindrome> ::= “a” <palindrome> “a”
This means that “a”, followed by a <palindrome>, and by
“a” will always be considered as <palindrome>, no matter
what is the context, hence context free.
In a Context-Sensitive Grammar, left hand of production rules
can include other things
An Example Context-Sensitive
Grammar for anbncn
1.
S aBC
2.
S  a S B C
S→2 aSBC
→1 aaBCBC
3. B C  C B
→3 aaBBCC
4. a B  a b
→4 aabBCC
5. b B  b b
→5 aabbCC
6. b C  b c
→6 aabbcC
7. c C  c c
→7 aabbcc
Yacc & Parsing
There are many ways to parse BNF grammars,
most of which are discussed in a compilers
course.
• Recall: A finite state automaton (FSA) is
used for regular expressions. (CS182).
• For a context-free grammar, we use a
pushdown automaton, which combines
features of a FSA with a stack.
Yacc & Parsing
Yacc generates what is known as a LALR
parser, which is generated from the BNF
grammar in your Yacc file. This parser is
defined in the C source file that Yacc
generates.
We use Lex to make a lexer to generate our
terminals, which are matched with regular
expressions before being fed into the parser.
Yacc & Parsing
Input Characters
LEX
Lexer
terminals
YACC
Parser
RuleBased
Behavior
Yacc is capable of generating a powerful
parser that will handle many different
grammars.
Yacc & Parsing
•
Recall that parsing combines a state machine with
a stack. States go on a stack to keep track of
where parsing is. Yacc uses a parse table which
defines possible states.
• Yacc’s parser operates using two primary actions,
shift and reduce.
• shift puts a state on the stack, reduce pops state(s)
off the stack and reduces combinations of
nonterminals and terminals to a single
nonterminal. After a reduction to a rule, Yacc’s
parser will optionally run some user-defined
code.
Yacc & Parsing
A very basic example:
<rule> := “hello” “world” “\n”
The parser would shift each word,
successively pushing each state (.”hello”,
.”world”, .”\n”) onto the stack. Then at the
end of the rule, reduce everything to <rule>
and pop the three states.
A Lex/Yacc Infix Calculator
• Yacc’s parser is powerful, but is not
capable of parsing all grammars.
• Certain ambiguous grammars may produce
what is known as a shift/reduce or
reduce/reduce conflict. Yacc will, by
default, shift instead of reduce.
Yacc & Parsing
Consider the classic shift/reduce conflict
example:
<ifexp> ::= IF <expr> THEN <stmt> ELSE <stmt>
|
IF <expr> THEN <stmt>
Yacc will have a shift/reduce conflict here, but
will go with shift (the top option) by default.
It’s greedy!
A Lex/Yacc Infix Calculator
• To demonstrate the utility of Lex and Yacc
(or in our case, Flex and Bison) we provide
an example infix calculator.
• Similar to several of the examples provided
on the Lex and Yacc manpage at
http://dinosaur.compilertools.net, but with
added features
A Lex/Yacc Infix Calculator
• Make sure to read ALL source code
comments, particularly those that describe
source file organization.
• Lex definition file: calculator.l
• Yacc grammar file: calculator.y
• AST Classes: ast.cc
• Symbol table: symtab.cc
A Lex/Yacc Infix Calculator
The example calculator application uses Lex
and Yacc to parse mathematical expressions
and produce an Abstract Syntax Tree, which is
then used to evaluate those expressions.
It allows the =, +, *, -, +, ^, () and unary
minus operators, with appropriate levels of
binding and precedence. Examine calculator.y,
because it is heavily commented.
A Lex/Yacc Infix Calculator
• The symbol table (implemented here in
simple O(n) access time) maps variables to
values.
• Print the AST after every expression
evaluation by running calculator with the –t
flag, e.g. “calc –t”.
A Lex/Yacc Infix Calculator
A calculator example. Type “2*2^3/3 and press enter:
calc> 2*2^3/3
= 5.333333
3.000
/
(/)
\
3.000
/
(^)
\
2.000
/
(*)
\
2.000
Review
• What are required:
• Able to write simple Context Free
Grammars, similar to those used in
implementing FIZ
• Able to determine whether a string of
tokens is accepted by a grammar
• Able to show how a string of tokens is
parsed into some non-terminal (i.e.,
draw the parsing tree)

similar documents