0% found this document useful (0 votes)
68 views7 pages

CS107-08 CFGs

Context-Free Grammar (CFG) is a model for describing languages using production rules that derive strings from variables. CFGs are applicable in both natural languages and programming languages, allowing for the definition of complex structures like sentences and expressions. The document also discusses Backus-Naur Form (BNF) as a notation for CFGs and explores examples and exercises related to designing CFGs for various languages.

Uploaded by

'Elijah Recto
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views7 pages

CS107-08 CFGs

Context-Free Grammar (CFG) is a model for describing languages using production rules that derive strings from variables. CFGs are applicable in both natural languages and programming languages, allowing for the definition of complex structures like sentences and expressions. The document also discusses Backus-Naur Form (BNF) as a notation for CFGs and explores examples and exercises related to designing CFGs for various languages.

Uploaded by

'Elijah Recto
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Context-Free Grammar

CS 107
Theory of Automata
Chapter 08

This is a different model for describing


languages

The language is specified by productions


(substitution rules) that tell how strings can be obtained, e.g.
A 0A1 AB B# A, B are variables 0, 1, # are terminals A is the start variable

Using these rules, we can derive strings like


this:
A 0A1 00A11 000A111 000B111 000#111
CJD

Context-Free Grammars Context-

Natural Language

English and CFGs


We can describe (some fragments) of the English language by a context-free grammar:
SENTENCE NOUN-PHRASE VERB-PHRASE NOUN-PHRASE CMPLX-NOUN NOUN-PHRASE CMPLX-NOUN PREP-PHRASE VERB-PHRASE CMPLX-VERB VERB-PHRASE CMPLX-VERB PREP-PHRASE PREP-PHRASE PREP CMPLX-NOUN CMPLX-NOUN ARTICLE NOUN CMPLX-VERB VERB NOUN-PHRASE CMPLX-VERB VERB ARTICLE a ARTICLE the NOUN boy NOUN girl NOUN flower VERB likes VERB touches VERB sees PREP with

Context-free grammars were first used for


natural languages a girl with a flower likes the boy
ART NOUN CMPLX-NOUN PREP ART NOUN VERB ART NOUN

CMPLX-NOUN PREP-PHRASE

CMPLX-NOUN NOUN-PHRASE CMPLX-VERB

variables: SENTENCE, NOUN-PHRASE,


NOUN-PHRASE VERB-PHRASE

terminals: a, the, boy, girl, flower, likes, touches, sees, with start variable: SENTENCE

SENTENCE
CJD CJD

Programming Languages

CFGs for Compilers

Context-free grammars are also used to


describe (parts of) programming languages

Context-free grammars are essential for


understanding the meaning of computer programs code: (2 + 3) * 5

For instance, expressions like (2 + 3) * 5 or


3 + (8 + 2) * 7 can be described by the CFG
<expr> <expr> + <expr> <expr> <expr> * <expr> <expr> (<expr>) <expr> 0 <expr> 1 <expr> 9
CJD

Variables: <expr> Terminals: +, *, (, ), 0, 1, , 9

meaning: add 2 and 3, and then multiply by 5

They are used in compilers


CJD

BNF
John Backus and Peter Naur BNF: Backus-Naur Form
A way to describe grammars and define the syntax of programming languages (Algol), 1959-1963

Example
<exp> ::= <exp> - <exp> | <exp> * <exp> | <exp> = <exp> | <exp> < <exp> | (<exp>) | a | b | c

A BNF grammar is a CFG, with notational changes:


Nonterminals are written as words enclosed in angle brackets: <exp> instead of E Productions use ::= instead of The empty string is <empty> instead of

This BNF generates a little language of


expressions which includes : a < b ( a - ( b * c ) ) ( b * a ) = ( c < b ) - a

CFGs (due to Chomsky) came a few years earlier, but


BNF was developed independently
CJD

CJD

Example
<stmt> ::= <exp-stmt> | <while-stmt> | <compound-stmt> |... <exp-stmt> ::= <exp> ; <while-stmt> ::= while ( <exp> ) <stmt> <compound-stmt> ::= { <stmt-list> } <stmt-list> ::= <stmt> <stmt-list> | <empty> Element is Text or

HTML
a subset of HTML can be described as follows :
Doc is a sequence of elements

This BNF generates C-like statements, like


while (a<b) { c = c * a; a = a + a; }

A pair of matching tags and the document between them, or Unmatched tag followed by a document Text is any string of characters literally interpreted (i.e. there are no tags, user-text) Char is any single character legal in HTML tags List is a sequence of zero or more list items ListItem is the <LI> tag followed by a document followed by </LI>
CJD CJD

This is just a toy example; the BNF grammar for a full


language may include hundreds of productions

Limited HTML Grammar

Formal Definition
A Context-Free Grammar (CFG) is a 4-tuple (V, T, P, S) where V is a finite set of variables or non-terminals T is a finite set of terminals (V T = ) P is a set of productions or substitution rules of the form A where A is a symbol in V and is a string over VT S is a variable in V called the start variable

Doc | Element Doc Element Text | <EM> Doc </EM> | <P> Doc |
<OL> List </OL>

Text | Char Text Char a | A | List | ListItem List ListItem <LI> Doc </LI>

CJD

CJD

Convention
Variables : first few uppercase letters and S
ex. A, B, C, D, E, S S is start symbol unless otherwise specified

Shorthand for Productions


When we have multiple productions with the same variable on the left like
EE+E EE*E E (E) EN N 0N N 1N N0 N1 Variables: E, N Terminals: +, *, (, ), 0, 1 Start variable: E

Terminals : digits and first few lowercase letters


ex. a, b, c, d, e, 0, 1, 9

Symbols (Variables or Terminals) : last few uppercase letters


ex. X, Y, Z

we can write this in shorthand as


E E + E | E * E | (E) | N N 0N | 1N | 0 | 1

Strings of Terminals : last few lowercase letters


ex. u, v, w, x, y, z

Strings of variables and terminals : greek letters


ex. , ,
CJD CJD

Derivation
A derivation is a top-down sequential application of productions:
E E*E (E) * E (E) * N (E + E ) * N (E + E ) * 1 (E + N) * 1 (N + N) * 1 (N + 1N) * 1 (N + 10) * 1 (1 + 10) * 1 means can be obtained from with one production derivation * means can be obtained from after zero or more productions
i

Language of a CFG
* S

If contains variables and terminals, then is called a sentential form of G. If does not contain variables, it is called a sentence of G.

The language of a CFG G=(V, T, P, S) is the set of


all sentences of G.
* L = { w | w T * and S w }

means can be obtained from in exactly i productions


CJD

A language L is context-free if it is the language of


some CFG.
CJD

Example 1
productions :

Example 2
S SS | (S) |

A 0A1 | B B#

variables: A, B terminals: 0, 1, # start variable: A

Is the string 00#11 in L? How about 00#111, 00#0#1#11? What is the language of this CFG?
L = {0n#1n: n 0}
CJD

Give derivations of (), (()())


S (S) () (rule 2) (rule 3) S (S) (SS) ((S)S) ((S)(S)) (()(S)) (()()) (rule 2) (rule 1) (rule 2) (rule 2) (rule 3) (rule 3)

How about ())?

CJD

Examples 3 and 4

Example 5
Design a CFG for the following language:
L = { 0i1j | i j 2i, i=0,1, }, = {0, 1}
Consider two extreme cases: (a). if j = i, then L1 = { 0i1j: i=j }; (b). if j = 2i, then L2 = { 0i1j: 2i=j }. S S 0S1 red-rule S S 0S11 blue-rule

{ anb3n | n1 }
Each a on the left can be paired with three bs on the right That gives S aSbbb |

{ xy | x {a,b}*, y {c,d}*, and |x| = |y| }


Each symbol on the left (either a or b) can be paired with one on the right (either c or d) That gives S XSY | Xa|b Y c | d
CJD

If i j 2i , then randomly choose red-rule or blue-rule in the generation.

S S S

0S1 0S11
CJD

Example 5 Proof
L = {0i1j: i j 2i, i=0,1,},
G= S 0S1 S 0S11

Example 6
Design a CFG for the following language:
L = { aibjckdl | i,j,k,l=0,1,; i+k=j+l },
In other words, each a and c is matched by some b or d; and each b and d is matched by some a or c. To match a and d, use S aSd | To match a and b, use A aAb | To match c and b, use B bBc | To match c and d, use C cCd | a and d are far apart : they must be produced first by letting S be the start symbol. Afterwards, S must transition into the other productions that match adjacent terminals with S ABC
CJD CJD

= {0, 1}
S

Need to verify L = L(G)

1). L(G) is a subset of L: The red-rule and blue-rule guarantee that in each derivation, the number of 1s generated is one or two times larger than that of 0s. So, L(G) is a subset of L. 2). L is a subset of L(G): For any w = 0i1j, i j 2i, we use red-rule (2i - j) times and then blue-rule ( j - i ) times, i.e., * * S 02i-jS12i-j 02i-j0 j-iS12(j-i)12i-j 0i1j = w

Example 6
L = { aibjckdl | i,j,k,l=0,1,; i+k=j+l },
Suppose
n

Exercise: Designing CFGs

Design a CFG for the following languages


l n l n

To get a i and d l we need


n n n

S a Sd a ABCd
w

A a b

in i n

and C c d

Linear equations over integers, x, y, z, like: x + 5y z = 9 11x y = 2 Numbers without leading zeros, e.g., 109, 0 but not 019 L1 = {anbncmdm | n 0, m 0} L2 = {anbmcmdn | n 0, m 0} L3 = { 0n1n | n 1 } L4 = { aibjck | ij or jk }

To get b j and c k we need

an S
A B

dn
C

B b j (i n)ck (l n)
So w L j-(i-n)=k-(l-n). i+k = j+l

ai-n bi-n

bj-(i-n) ck-(l-n) cl-n

dl-n
CJD

CJD

CFLs vs Regular Languages

From Regular to Context-Free


regular expression CFG

Write a CFG for the language (0 + 1)*111


S A111 A | 0A | 1A

Can you do so for every regular language?


Every regular language is context-free


a (alphabet symbol) E1 + E2 E1E2 E1*

grammar with no rules S Sa S S1 | S2 S S1S2 S SS1 |

Proof:
regular expression NFA DFA
CJD

In all cases, S becomes the new start symbol


CJD

CFLs vs Regular Languages

Another CFL that is Not Regular


Language of palindromes
We can easily show using the pumping lemma that the language L = { w | w = wR } is not regular. However, we can describe this language by the following context-free grammar over the alphabet {0,1}: P P 0 P 1 Inductive definition P 0P0 P 1P1 More compactly: P
CJD

Is every context-free language regular? No! We already saw some examples:


A 0A1 | B B#

L = {0n#1n: n 0}

This language is context-free but not regular

| 0 | 1 | 0P0 | 1P1
CJD

Parse Trees

Definition of a Parse Tree

Derivations can also be represented using


parse trees
E E + E | E - E | (E) | V Vx|y|z EE+E V+E x+E x + (E) x + (E E) x + (V E) x + (y E) x + (y V) x + (y z) E + V x ( E V y E E E E V z )

A parse tree or derivation tree for a CFG G is an


ordered tree with labels on the nodes such that Every internal node is labeled by a variable The root is labeled S Nodes labeled by are leaves with no siblings If a node is labeled A and has children X1, , Xk from left to right, then the rule A X1 Xk is a production in G.

The yield of the parse tree is x+(yz)


CJD

A subtree is a node of a tree with all its


descendants and connecting edges.
CJD

A BNF Parse Tree


<exp> <ltexp> <subexp> <subexp> <mulexp> <rootexp> a <mulexp> <mulexp> <rootexp> b * <rootexp> c

CFGs and Parse Trees


Theorem : Let G = (V,T,P,S) be a context-free grammar. * Then S if and only if there is a parse tree in grammar G with yield . Proof : Prove a stronger version of theorem :
* For any AV, A there is an A-tree (ie. rooted at A) with yield .

<exp> ::= <ltexp> = <exp> | <ltexp> <ltexp> ::= <ltexp> < <subexp> | <subexp> <subexp> ::= <subexp> - <mulexp> | <mulexp> <mulexp> ::= <mulexp> * <rootexp> | <rootexp> <rootexp> ::= (<exp>) | a | b | c

So if it is true for any A, it is true for S.


CJD CJD

Parse Tree to Derivation


Prove: If G has a parse A-tree with yield , then A * Inductive proof on number of interior vertices (i.v.) of A-tree. Basis: If there is only one i.v., it is A. A has children X1, X2, Suppose result is true if |i.v.|<k, k>1.
* , Xn where yield =X1X2Xn. So A and A. Let be the yield of the A-tree with k -i.v.s. Let sons of A be X1, X2, , Xn so AX1X2Xn.

Derivation to Parse Tree


* Prove: If A then G has a parse tree with yield Inductive proof on number of steps in derivation of . * Basis: Suppose A in 1 step. Then A=X1X2Xn and there is an A-tree with children the Xis and yield . * Suppose BV, B in less than k steps, k>1, has a parse * tree with yield , and suppose A=1 2n in k steps, * with the first step AX1X2Xn so that Xii if Xi is a variable or Xi=i if Xi is a terminal. If Xi is a variable, it derives i in less than k steps, so has a parse Xi-tree with yield i. Construct the A-tree with children Xi, and each Xi that are terminals by i and each Xi that are variables by the Xi-tree. Clearly, this A-tree is a parse tree with yield .
CJD CJD

For each Xi which is not a leaf (they exist bec. k>1) Xi is a variable with yield i. Since Xitree has fewer than k -i.v.'s, * by inductive hyp, Xii. * For any Xj which are leaves, Xj=j , So Xjj * So A X1X2Xn 1 2n = . * So A .

Leftmost & Rightmost Derivations


Leftmost Derivation always derives from the leftmost
variable first : EE+E V+E x+E x + (E) x + (E E) x + (V E) x + (y E) x + (y V) x + (y z) E + V x ( E V y variable first.
CJD

Many LM and RM Derivations

E E E E V z )

Let L be a CFL and w L. A leftmost derivation of w corresponds to


exactly one parse tree and vice versa.

A parse tree of w corresponds to exactly one


rightmost derivation and vice versa.

w may have one or more derivations. w may have one or more leftmost derivation
(i.e. w may have one or more parse trees.)

w may have one or more rightmost derivation


(i.e. w may have one or more parse trees.)
CJD

Rightmost Derivation always derives from rightmost

Ambiguity
The parse tree represents the intended meaning. A grammar is ambiguous if some strings have more
than one parse tree (i.e. 1 lm- or 1 rm-derivation.)

Disambiguation Example
Some ambiguous grammars can be disambiguated by
enforcing precedence and associativity rules. E E+E | E-E | E*E | E/E | E^E | x | y | z | (E) precedence: ^ *,/ +,(right to left) (left to right) (left to right)
(start with most basic indivisible elements) (not F F^P because ^ is right to left) (in each step, refer to next higher precedence level)

Example:
E E +E

E E + E | E E | E E | (E) | V Vx|y|z E

V E E x V y V z

Both yield x+yz

E E E +E V V x V z y

P x | y | z | (E) F P^F | P T T*F | T/F | F E E+T | E-T | T

first multiply y and z, and then add this to x

first add x and y, and then multiply z to this


CJD

In x*y^z+x/(y-z) T stands for term: x*y^z, x/(y-z), y, z F stands for factor: x, y^z, x, (y-z) P stands for power: x, y, z, y^z, (y-z)
CJD

Inherently Ambiguous Languages


Can we always disambiguate a grammar? No, for two reasons : string.

Recursive Inference

a bottom-up process for the derivation of a Example


E ET|E+T|ET TF|TF F (E) | V Vx|y|z T T F V x
CJD

1.There exists inherently ambiguous context-free


languages L : Every CFG for such a language L is ambiguous.
Ex. L = { anbncmdm | n1,m1 } { anbmcmdn | n1,m1 } Text has shown: anbnc ndn, n1 has more than 1 derivation.

E T

2.There is no general procedure that can tell if a


grammar is ambiguous.

V y +

V z
CJD

However,

grammars used in programming languages can typically be disambiguated

Another Example
S aB | bA A a | aS | bAA B b | bS | aBB

End

Are ab, baba, abbbaa in L? How about a, bba? What is the language of this CFG? Is the CFG ambiguous?
CJD CJD

You might also like