Context-Free Grammar
CS 107
Theory of Automata
Chapter 08
This is a different model for describing
languages
The language is specified by productions
(substitution rules) that tell how strings can be obtained, e.g.
A 0A1 AB B# A, B are variables 0, 1, # are terminals A is the start variable
Using these rules, we can derive strings like
this:
A 0A1 00A11 000A111 000B111 000#111
CJD
Context-Free Grammars Context-
Natural Language
English and CFGs
We can describe (some fragments) of the English language by a context-free grammar:
SENTENCE NOUN-PHRASE VERB-PHRASE NOUN-PHRASE CMPLX-NOUN NOUN-PHRASE CMPLX-NOUN PREP-PHRASE VERB-PHRASE CMPLX-VERB VERB-PHRASE CMPLX-VERB PREP-PHRASE PREP-PHRASE PREP CMPLX-NOUN CMPLX-NOUN ARTICLE NOUN CMPLX-VERB VERB NOUN-PHRASE CMPLX-VERB VERB ARTICLE a ARTICLE the NOUN boy NOUN girl NOUN flower VERB likes VERB touches VERB sees PREP with
Context-free grammars were first used for
natural languages a girl with a flower likes the boy
ART NOUN CMPLX-NOUN PREP ART NOUN VERB ART NOUN
CMPLX-NOUN PREP-PHRASE
CMPLX-NOUN NOUN-PHRASE CMPLX-VERB
variables: SENTENCE, NOUN-PHRASE,
NOUN-PHRASE VERB-PHRASE
terminals: a, the, boy, girl, flower, likes, touches, sees, with start variable: SENTENCE
SENTENCE
CJD CJD
Programming Languages
CFGs for Compilers
Context-free grammars are also used to
describe (parts of) programming languages
Context-free grammars are essential for
understanding the meaning of computer programs code: (2 + 3) * 5
For instance, expressions like (2 + 3) * 5 or
3 + (8 + 2) * 7 can be described by the CFG
<expr> <expr> + <expr> <expr> <expr> * <expr> <expr> (<expr>) <expr> 0 <expr> 1 <expr> 9
CJD
Variables: <expr> Terminals: +, *, (, ), 0, 1, , 9
meaning: add 2 and 3, and then multiply by 5
They are used in compilers
CJD
BNF
John Backus and Peter Naur BNF: Backus-Naur Form
A way to describe grammars and define the syntax of programming languages (Algol), 1959-1963
Example
<exp> ::= <exp> - <exp> | <exp> * <exp> | <exp> = <exp> | <exp> < <exp> | (<exp>) | a | b | c
A BNF grammar is a CFG, with notational changes:
Nonterminals are written as words enclosed in angle brackets: <exp> instead of E Productions use ::= instead of The empty string is <empty> instead of
This BNF generates a little language of
expressions which includes : a < b ( a - ( b * c ) ) ( b * a ) = ( c < b ) - a
CFGs (due to Chomsky) came a few years earlier, but
BNF was developed independently
CJD
CJD
Example
<stmt> ::= <exp-stmt> | <while-stmt> | <compound-stmt> |... <exp-stmt> ::= <exp> ; <while-stmt> ::= while ( <exp> ) <stmt> <compound-stmt> ::= { <stmt-list> } <stmt-list> ::= <stmt> <stmt-list> | <empty> Element is Text or
HTML
a subset of HTML can be described as follows :
Doc is a sequence of elements
This BNF generates C-like statements, like
while (a<b) { c = c * a; a = a + a; }
A pair of matching tags and the document between them, or Unmatched tag followed by a document Text is any string of characters literally interpreted (i.e. there are no tags, user-text) Char is any single character legal in HTML tags List is a sequence of zero or more list items ListItem is the <LI> tag followed by a document followed by </LI>
CJD CJD
This is just a toy example; the BNF grammar for a full
language may include hundreds of productions
Limited HTML Grammar
Formal Definition
A Context-Free Grammar (CFG) is a 4-tuple (V, T, P, S) where V is a finite set of variables or non-terminals T is a finite set of terminals (V T = ) P is a set of productions or substitution rules of the form A where A is a symbol in V and is a string over VT S is a variable in V called the start variable
Doc | Element Doc Element Text | <EM> Doc </EM> | <P> Doc |
<OL> List </OL>
Text | Char Text Char a | A | List | ListItem List ListItem <LI> Doc </LI>
CJD
CJD
Convention
Variables : first few uppercase letters and S
ex. A, B, C, D, E, S S is start symbol unless otherwise specified
Shorthand for Productions
When we have multiple productions with the same variable on the left like
EE+E EE*E E (E) EN N 0N N 1N N0 N1 Variables: E, N Terminals: +, *, (, ), 0, 1 Start variable: E
Terminals : digits and first few lowercase letters
ex. a, b, c, d, e, 0, 1, 9
Symbols (Variables or Terminals) : last few uppercase letters
ex. X, Y, Z
we can write this in shorthand as
E E + E | E * E | (E) | N N 0N | 1N | 0 | 1
Strings of Terminals : last few lowercase letters
ex. u, v, w, x, y, z
Strings of variables and terminals : greek letters
ex. , ,
CJD CJD
Derivation
A derivation is a top-down sequential application of productions:
E E*E (E) * E (E) * N (E + E ) * N (E + E ) * 1 (E + N) * 1 (N + N) * 1 (N + 1N) * 1 (N + 10) * 1 (1 + 10) * 1 means can be obtained from with one production derivation * means can be obtained from after zero or more productions
i
Language of a CFG
* S
If contains variables and terminals, then is called a sentential form of G. If does not contain variables, it is called a sentence of G.
The language of a CFG G=(V, T, P, S) is the set of
all sentences of G.
* L = { w | w T * and S w }
means can be obtained from in exactly i productions
CJD
A language L is context-free if it is the language of
some CFG.
CJD
Example 1
productions :
Example 2
S SS | (S) |
A 0A1 | B B#
variables: A, B terminals: 0, 1, # start variable: A
Is the string 00#11 in L? How about 00#111, 00#0#1#11? What is the language of this CFG?
L = {0n#1n: n 0}
CJD
Give derivations of (), (()())
S (S) () (rule 2) (rule 3) S (S) (SS) ((S)S) ((S)(S)) (()(S)) (()()) (rule 2) (rule 1) (rule 2) (rule 2) (rule 3) (rule 3)
How about ())?
CJD
Examples 3 and 4
Example 5
Design a CFG for the following language:
L = { 0i1j | i j 2i, i=0,1, }, = {0, 1}
Consider two extreme cases: (a). if j = i, then L1 = { 0i1j: i=j }; (b). if j = 2i, then L2 = { 0i1j: 2i=j }. S S 0S1 red-rule S S 0S11 blue-rule
{ anb3n | n1 }
Each a on the left can be paired with three bs on the right That gives S aSbbb |
{ xy | x {a,b}*, y {c,d}*, and |x| = |y| }
Each symbol on the left (either a or b) can be paired with one on the right (either c or d) That gives S XSY | Xa|b Y c | d
CJD
If i j 2i , then randomly choose red-rule or blue-rule in the generation.
S S S
0S1 0S11
CJD
Example 5 Proof
L = {0i1j: i j 2i, i=0,1,},
G= S 0S1 S 0S11
Example 6
Design a CFG for the following language:
L = { aibjckdl | i,j,k,l=0,1,; i+k=j+l },
In other words, each a and c is matched by some b or d; and each b and d is matched by some a or c. To match a and d, use S aSd | To match a and b, use A aAb | To match c and b, use B bBc | To match c and d, use C cCd | a and d are far apart : they must be produced first by letting S be the start symbol. Afterwards, S must transition into the other productions that match adjacent terminals with S ABC
CJD CJD
= {0, 1}
S
Need to verify L = L(G)
1). L(G) is a subset of L: The red-rule and blue-rule guarantee that in each derivation, the number of 1s generated is one or two times larger than that of 0s. So, L(G) is a subset of L. 2). L is a subset of L(G): For any w = 0i1j, i j 2i, we use red-rule (2i - j) times and then blue-rule ( j - i ) times, i.e., * * S 02i-jS12i-j 02i-j0 j-iS12(j-i)12i-j 0i1j = w
Example 6
L = { aibjckdl | i,j,k,l=0,1,; i+k=j+l },
Suppose
n
Exercise: Designing CFGs
Design a CFG for the following languages
l n l n
To get a i and d l we need
n n n
S a Sd a ABCd
w
A a b
in i n
and C c d
Linear equations over integers, x, y, z, like: x + 5y z = 9 11x y = 2 Numbers without leading zeros, e.g., 109, 0 but not 019 L1 = {anbncmdm | n 0, m 0} L2 = {anbmcmdn | n 0, m 0} L3 = { 0n1n | n 1 } L4 = { aibjck | ij or jk }
To get b j and c k we need
an S
A B
dn
C
B b j (i n)ck (l n)
So w L j-(i-n)=k-(l-n). i+k = j+l
ai-n bi-n
bj-(i-n) ck-(l-n) cl-n
dl-n
CJD
CJD
CFLs vs Regular Languages
From Regular to Context-Free
regular expression CFG
Write a CFG for the language (0 + 1)*111
S A111 A | 0A | 1A
Can you do so for every regular language?
Every regular language is context-free
a (alphabet symbol) E1 + E2 E1E2 E1*
grammar with no rules S Sa S S1 | S2 S S1S2 S SS1 |
Proof:
regular expression NFA DFA
CJD
In all cases, S becomes the new start symbol
CJD
CFLs vs Regular Languages
Another CFL that is Not Regular
Language of palindromes
We can easily show using the pumping lemma that the language L = { w | w = wR } is not regular. However, we can describe this language by the following context-free grammar over the alphabet {0,1}: P P 0 P 1 Inductive definition P 0P0 P 1P1 More compactly: P
CJD
Is every context-free language regular? No! We already saw some examples:
A 0A1 | B B#
L = {0n#1n: n 0}
This language is context-free but not regular
| 0 | 1 | 0P0 | 1P1
CJD
Parse Trees
Definition of a Parse Tree
Derivations can also be represented using
parse trees
E E + E | E - E | (E) | V Vx|y|z EE+E V+E x+E x + (E) x + (E E) x + (V E) x + (y E) x + (y V) x + (y z) E + V x ( E V y E E E E V z )
A parse tree or derivation tree for a CFG G is an
ordered tree with labels on the nodes such that Every internal node is labeled by a variable The root is labeled S Nodes labeled by are leaves with no siblings If a node is labeled A and has children X1, , Xk from left to right, then the rule A X1 Xk is a production in G.
The yield of the parse tree is x+(yz)
CJD
A subtree is a node of a tree with all its
descendants and connecting edges.
CJD
A BNF Parse Tree
<exp> <ltexp> <subexp> <subexp> <mulexp> <rootexp> a <mulexp> <mulexp> <rootexp> b * <rootexp> c
CFGs and Parse Trees
Theorem : Let G = (V,T,P,S) be a context-free grammar. * Then S if and only if there is a parse tree in grammar G with yield . Proof : Prove a stronger version of theorem :
* For any AV, A there is an A-tree (ie. rooted at A) with yield .
<exp> ::= <ltexp> = <exp> | <ltexp> <ltexp> ::= <ltexp> < <subexp> | <subexp> <subexp> ::= <subexp> - <mulexp> | <mulexp> <mulexp> ::= <mulexp> * <rootexp> | <rootexp> <rootexp> ::= (<exp>) | a | b | c
So if it is true for any A, it is true for S.
CJD CJD
Parse Tree to Derivation
Prove: If G has a parse A-tree with yield , then A * Inductive proof on number of interior vertices (i.v.) of A-tree. Basis: If there is only one i.v., it is A. A has children X1, X2, Suppose result is true if |i.v.|<k, k>1.
* , Xn where yield =X1X2Xn. So A and A. Let be the yield of the A-tree with k -i.v.s. Let sons of A be X1, X2, , Xn so AX1X2Xn.
Derivation to Parse Tree
* Prove: If A then G has a parse tree with yield Inductive proof on number of steps in derivation of . * Basis: Suppose A in 1 step. Then A=X1X2Xn and there is an A-tree with children the Xis and yield . * Suppose BV, B in less than k steps, k>1, has a parse * tree with yield , and suppose A=1 2n in k steps, * with the first step AX1X2Xn so that Xii if Xi is a variable or Xi=i if Xi is a terminal. If Xi is a variable, it derives i in less than k steps, so has a parse Xi-tree with yield i. Construct the A-tree with children Xi, and each Xi that are terminals by i and each Xi that are variables by the Xi-tree. Clearly, this A-tree is a parse tree with yield .
CJD CJD
For each Xi which is not a leaf (they exist bec. k>1) Xi is a variable with yield i. Since Xitree has fewer than k -i.v.'s, * by inductive hyp, Xii. * For any Xj which are leaves, Xj=j , So Xjj * So A X1X2Xn 1 2n = . * So A .
Leftmost & Rightmost Derivations
Leftmost Derivation always derives from the leftmost
variable first : EE+E V+E x+E x + (E) x + (E E) x + (V E) x + (y E) x + (y V) x + (y z) E + V x ( E V y variable first.
CJD
Many LM and RM Derivations
E E E E V z )
Let L be a CFL and w L. A leftmost derivation of w corresponds to
exactly one parse tree and vice versa.
A parse tree of w corresponds to exactly one
rightmost derivation and vice versa.
w may have one or more derivations. w may have one or more leftmost derivation
(i.e. w may have one or more parse trees.)
w may have one or more rightmost derivation
(i.e. w may have one or more parse trees.)
CJD
Rightmost Derivation always derives from rightmost
Ambiguity
The parse tree represents the intended meaning. A grammar is ambiguous if some strings have more
than one parse tree (i.e. 1 lm- or 1 rm-derivation.)
Disambiguation Example
Some ambiguous grammars can be disambiguated by
enforcing precedence and associativity rules. E E+E | E-E | E*E | E/E | E^E | x | y | z | (E) precedence: ^ *,/ +,(right to left) (left to right) (left to right)
(start with most basic indivisible elements) (not F F^P because ^ is right to left) (in each step, refer to next higher precedence level)
Example:
E E +E
E E + E | E E | E E | (E) | V Vx|y|z E
V E E x V y V z
Both yield x+yz
E E E +E V V x V z y
P x | y | z | (E) F P^F | P T T*F | T/F | F E E+T | E-T | T
first multiply y and z, and then add this to x
first add x and y, and then multiply z to this
CJD
In x*y^z+x/(y-z) T stands for term: x*y^z, x/(y-z), y, z F stands for factor: x, y^z, x, (y-z) P stands for power: x, y, z, y^z, (y-z)
CJD
Inherently Ambiguous Languages
Can we always disambiguate a grammar? No, for two reasons : string.
Recursive Inference
a bottom-up process for the derivation of a Example
E ET|E+T|ET TF|TF F (E) | V Vx|y|z T T F V x
CJD
1.There exists inherently ambiguous context-free
languages L : Every CFG for such a language L is ambiguous.
Ex. L = { anbncmdm | n1,m1 } { anbmcmdn | n1,m1 } Text has shown: anbnc ndn, n1 has more than 1 derivation.
E T
2.There is no general procedure that can tell if a
grammar is ambiguous.
V y +
V z
CJD
However,
grammars used in programming languages can typically be disambiguated
Another Example
S aB | bA A a | aS | bAA B b | bS | aBB
End
Are ab, baba, abbbaa in L? How about a, bba? What is the language of this CFG? Is the CFG ambiguous?
CJD CJD