CHANDIGARH COLLEGE OF ENGINEERING AND
TECHNOLOGY
(DEGREE WING)
Government Institute under Chandigarh (UT) Administration,
Affiliated to Panjab University, Chandigarh
Sector-26, Chandigarh, PIN-160019
ASSIGNMENT 1
SUBJECT: - COMPILER DESIGN
COMPUTER SCIENCE AND ENGINEERING
SUBMITTED BY: SUBMITTED TO:
Deepika Goyal Dr. Gulshan Goyal
MCO22385 CSE Department
Diksha
CO22325
LEXICAL ANALYSIS
1. Consider the following C code fragment:
while (count < 5) {
total += count;
count = count + 1;
For each lexeme list, specify the corresponding token.
Lexeme Token
while <KEYWORD,
"while">
( <LPAREN, "(">
count <IDENTIFIER,
"count">
< <RELOP, "<">
5 <NUMBER, 5>
) <RPAREN, ")">
{ <LBRACE, "{">
total <IDENTIFIER,
"total">
+= <ADD_ASSIGN,
"+=">
count <IDENTIFIER,
"count">
; <SEMICOLON, ";">
count <IDENTIFIER,
"count">
= <ASSIGN, "=">
count <IDENTIFIER,
"count">
+ <ADDOP, "+">
1 <NUMBER, 1>
; <SEMICOLON, ";">
} <RBRACE, "}">
2. Evaluate how choosing an NFA versus a DFA affects lexer design and runtime. For each
(NFA and DFA), state one advantage and one drawback in token recognition.
Using an NFA for token recognition simplifies lexer construction since each regular expression
can be translated almost directly into an NFA with ε-transitions, making it easy to add or modify
token rules without extensive reworking. However, at runtime the NFA’s nondeterminism means
the lexer may need to track multiple active states or backtrack when paths fail, which can
significantly slow down scanning on large or complex inputs. In contrast, a DFA provides truly
deterministic scanning: after the expensive initial conversion from NFA to DFA, each input
character leads to exactly one state transition, enabling linear‐time lexing that is as fast as
possible. The drawback is that this conversion may produce a DFA with exponentially many
states, so the resulting transition table can grow too large to store efficiently, especially in
memory‐constrained environments.
3. Given a C source file containing comments, whitespace, and preprocessor directives,
analyze how a lexical analyzer processes each of these elements and explain why isolating
them from the token stream is critical for the correctness of subsequent parsing and
semantic analysis.
When a lexical analyzer encounters comments, it uses predefined patterns—typically regular
expressions that match both single‐line (//…) and multi‐line (/*…*/) comment forms—to
recognize and discard the entire comment sequence without emitting any tokens; similarly,
whitespace (spaces, tabs, newlines) is identified through simple character‐class rules and
removed so that only meaningful lexemes remain. Preprocessor directives, on the other hand, are
usually recognized by a leading # and either forwarded as a single, undivided token to a
dedicated preprocessing phase (where directives such as #include, #define, and #ifdef are
handled) or stripped out entirely if preprocessing occurs prior to lexical analysis. This separation
is important because it ensures that the parser receives a clean stream of syntactically relevant
tokens—free from formatting noise or directive syntax—so that grammar rules and semantic
checks can be applied consistently and efficiently without needing to handle comment text or
whitespace, which have no effect on program structure, and without confusing directive syntax
with regular language constructs.
4. Imagine you are building a compiler’s lexical analyzer for a new programming language.
Analyze the sequence of tasks the lexer must perform—from handling raw input to
producing a clean token stream—and explain why dividing these tasks into distinct phases
is essential for correct and efficient compilation.
A lexical analyzer begins by buffering the raw source text, handling line continuations or
inclusion markers so that it can look ahead efficiently without repeated disk access. In this phase,
the lexer ensures that the entire input is available in memory in a format that supports fast
character‐by‐character examination. Next, the lexer proceeds to pattern matching, where it uses
deterministic finite automata derived from regular expressions to recognize lexemes such as
identifiers, keywords, literals, and operators. By isolating lexeme recognition into its own phase,
the lexer can focus solely on identifying valid substrings without worrying about how they will
be represented or used later. Once a lexeme is matched, the lexer enters the token generation
phase, converting each lexeme into a token object that contains a token type and any needed
attribute (for example, the actual numeric value of a literal or a pointer to a symbol‐table entry).
Separating token generation ensures that downstream components receive a uniform, abstract
representation rather than raw character sequences. Finally, the lexer performs error handling: if
it encounters illegal characters, an unterminated string, or any malformed construct, it reports the
error immediately with location information. Keeping error detection at this stage prevents
invalid lexemes from propagating into parsing or semantic analysis, where they could cause
misleading or cascading failures. By dividing buffering, pattern matching, token creation, and
error handling into distinct phases, the lexer can be optimized and maintained more easily, and
later compiler stages can operate on a reliable, noise‐free token stream.
5. Given an NFA in which state q₀ has ε‐transitions to q₁ and q₂, and q₁ in turn has an ε‐
transition to q₃, analyze the reachable states and determine the ε‐closure of { q₀ }. Explain
your reasoning.
Starting from q₀, you can follow an ε‐transition to q₁ and also to q₂, so q₁ and q₂ are included.
From q₁, there is another ε‐transition to q₃, so q₃ is also reachable without consuming any input.
Therefore, the ε‐closure of { q₀ } consists of all the states { q₀, q₁, q₂, q₃ }.
6. Given the requirement to recognize identifiers (letters followed by letters or digits) and
integer literals in source code, analyze how you would structure a Lex specification. In
particular, define auxiliary patterns for letter and digit, and then write the transition
rules that produce ID(...) for identifiers and NUM(...) for integer literals, while
skipping whitespace and reporting any other single character as unknown.
/* Auxiliary definitions */
letter [A-Za-z]
digit [0-9]
%%
/* Transition rules */
{letter}({letter}|{digit})* { printf("ID(%s)\n", yytext); }
{digit}+ { printf("NUM(%s)\n", yytext); }
[ \t\n]+ { /* skip whitespace */ }
. { printf("UNK(%s)\n", yytext); }
%%
/* No auxiliary procedures are needed */
Define two named patterns: letter matches any alphabetic character, and digit matches
any numeral. Between the %% markers, write our transition rules. The first rule matches
identifiers—one letter followed by zero or more letter or digit—and prints them as ID(...).
The second rule matches one or more digits and prints them as NUM(...). Whitespace is
skipped, and any other single character is reported as UNK(...). After the final %%, leave
the auxiliary procedure section empty because no helper functions are needed.
7. Analyze why a compiler’s lexical analyzer must build and maintain a symbol table as it
processes source code. Illustrate your explanation with a concrete example showing how
identifier information is used later.
A symbol table is essential during lexical analysis because it serves as the compiler’s repository
for each identifier’s attributes—such as its name, type, and scope—allowing later phases to
enforce language rules and generate correct code. For instance, when the lexer encounters the
declaration int count = 0;, it adds an entry for count keyed by its lexeme, storing that its
type is int and tracking its scope (e.g., global or local). Later, during semantic analysis, the
compiler consults the symbol table so that if it sees count = "text";, it can immediately
detect a type mismatch: the table tells the compiler that count was declared as an integer, so
assigning a string literal is invalid. Without this centralized mapping of identifier properties, the
compiler would have no reliable way to check that uses of count are consistent with its
declaration, leading potentially to undetected errors or incorrect code generation.
8. Imagine you are designing a lexer for a programming language. Analyze how regular
expressions are employed to define token patterns and explain why this approach is both
expressive and efficient in the context of lexical analysis.
Regular expressions specify each token’s valid character sequences—for example, [A-Za-z_]
[A-Za-z0-9_]* for identifiers—so the lexer can match input substrings against these
patterns. Internally, each regex is compiled into a finite automaton (typically a DFA), allowing
the lexer to scan characters and recognize the longest matching lexeme in linear time. Once a
match is found, the lexer emits a token of the corresponding type (e.g., ID("count") or
NUM("123")) and continues scanning. By using regexes, token definitions remain concise,
expressive, and efficiently executable.
9. Analyze how a lexer processes the statement
sum = x + 10;
step by step, identifying the boundaries between tokens and explaining why each lexeme is
recognized where it is.
The lexer begins reading sum = x + 10; from the first character. It first encounters s, u, and
m in sequence, recognizing that this contiguous sequence matches the identifier pattern (letter
followed by letters or digits). Upon seeing the space after sum, the lexer knows the identifier has
ended, so it emits an ID("sum") token and discards the whitespace. Next, it sees the =
character, which by itself matches the assignment‐operator pattern, so it immediately emits an
ASSIGN("=") token. After skipping the following space, the lexer reads x—a single letter that
again matches the identifier pattern—so it emits ID("x") when the next space appears. That
space is ignored, and then the + character is encountered; since + is defined as an addition
operator, the lexer emits an ADDOP("+") token. Skipping another space, the lexer reads the
characters 1 and 0 consecutively; recognizing that these together form a valid numeric literal
(digit‐only sequence), it emits NUM("10") as soon as the semicolon appears. Finally, the lexer
sees ; and emits SEMICOLON(";"). Each time the lexer detects a switch from letters (or
digits) to a non‐alphanumeric symbol—or vice versa—it knows a lexeme boundary has been
reached, allowing it to segment the input into the correct sequence of tokens.
INTERMEDIATE CODE GENERATOR
1. What does a compiler's intermediate code generation mean?
An essential step in a compiler is intermediate code generation, which converts high-level source
code into an intermediate representation (IR) that is unaffected by the architecture of the target
machine. The lexical, syntactic, and semantic analysis at the front end of the compiler and the
code optimization and target code creation at the back end are connected by this IR. Usually
simpler than the source code but more abstract than machine code, the intermediate code
facilitates analysis and optimization. A sentence such as x = a + b, for instance, could be
transformed into the three-address code (TAC) form t1 = a + b; x = t1.By separating the source
language from the target machine, this step guarantees platform portability, streamlines
optimization (e.g., eliminating pointless calculations), and makes it easier to generate target code
for diverse architectures.
2. What are the different kinds of Intermediate codes ?
In compiler design, intermediate codes serve as a bridge between source code and machine code,
with different forms serving specific purposes. The main types include Postfix Notation (Reverse
Polish Notation), which places operators after operands (e.g., a b + for a + b), enabling efficient
stack-based evaluation without parentheses. Syntax Trees represent the program's hierarchical
structure, with nodes for operations and leaves for operands, aiding in semantic analysis and
optimization. Three-Address Code appears in multiple forms: Quadruples (operator, operand1,
operand2, result) explicitly store temporary results, making code generation straightforward;
Triples (operator, operand1, operand2) reference results by position, saving space but
complicating optimization; and Indirect Triples, which use pointers to separate instruction order
from references, offering flexibility during code optimization. These representations balance
readability, optimization potential, and ease of translation to target code, with the choice
depending on compiler requirements.
3. Why do compilers use intermediate code?
Intermediate code is used by compilers to streamline the compilation process and improve
efficiency in a number of ways. By producing a machine-independent representation of the
source code, it first offers portability, enabling the use of the same intermediate code for several
target architectures (such as x86 and ARM) without requiring a front-end rewrite. Second, it
makes optimization easier by offering a more uniform and straightforward structure for
transformations that are more difficult to carry out directly on source or machine code, such as
constant folding, loop optimization, or dead code removal. For instance, if the statement repeats,
repeating t1 can optimize an intermediate code such as t1 = a + b.Third, it simplifies the process
of generating target code by decomposing the intricate source-to-machine translation into
smaller, more manageable phases. This facilitates the mapping of intermediate instructions to
particular machine instructions. Furthermore, intermediate code facilitates fault detection and
debugging during the compilation process, enhancing the compiler's overall resilience. :
4. How do TAC and DAG vary from one another?
Three-Address Code (TAC) and Directed Acyclic Graph (DAG) are two different intermediate
representations in compiler design that serve different functions. TAC is a linear series of
instructions with a maximum of three operands per instruction (t1 = a + b, for example). It is
utilized in later compilation stages such as code creation and optimization, looks like low-level
code, and is simple to generate. Nevertheless, duplicate calculations (such as calculating a + b
more than once) could be present in TAC. With a DAG, on the other hand, expressions are
represented graphically, with nodes standing in for variables or operations and edges indicating
dependencies (for example, a + node with children a and b).During optimization, DAGs are
mostly used to remove frequent subexpressions and redundant computations. For example, if a +
b appears twice, the DAG reuses the same node, which reduces computation. TAC and DAGs
are complementing tools in the compilation pipeline since TAC concentrates on a simple,
sequential form for translation and DAGs concentrate on structural analysis for optimization.
5. In intermediate code, what is a quadruple representation?
A quadruple is a kind of intermediate code representation used in compiler design that expresses
operations in a form with four fields: operator, argument1, argument2, and result. It is simple to
convert into machine code because each quadruple represents a single operation. For instance,
the quadruple (+, a, b, t1) represents the phrase t1 = a + b, where t1 is the result, a and b are the
arguments, and + is the operator. Because each field precisely identifies the operation's
components, quadruples are explicit and simple, which makes code generation and optimization
easier.A compiler can, for example, use quadruples to carry out transformations like constant
folding during optimization (e.g., substituting t1 = 2 + 3 with t1 = 5). Furthermore, quadruples
ensure compatibility with the three-address code structure used by many compilers by supporting
complex expressions by dividing them into smaller, more manageable instructions. :
6. Convert the following arithmetic expression into Three Address Code
x = (a * b) + (c / d) - (e % f)
The given expression is decomposed into a sequence of simple operations, each represented as a
quadruple (operator, operand1, operand2, result). The Three Address Code (TAC) for the
expression is generated as follows:
A. Multiplication Step: t1 = a * b
Here, a and b are multiplied, and the result is stored in temporary variable t1.
B. Division Step: t2 = c / d
The division of c by d is computed, and the result is stored in t2.
C. Modulus Step: t3 = e % f
The modulus operation (e % f) is performed, and the result is stored in t3.
D. Addition Step: t4 = t1 + t2
The intermediate results t1 (from multiplication) and t2 (from division) are added,
and the result is stored in t4.
E. Final Subtraction & Assignment: x = t4 - t3
The final result is obtained by subtracting t3 (modulus result) from t4 (sum of
multiplication and division), and the value is assigned to x.
7. What issue is resolved by indirect triple representation?
One significant drawback of standard triples in intermediate code representation is addressed by
indirect triples. Standard triples use stubs for each instruction (operator, argument1, argument2)
and reference the results by their index (i.e., (+, a, b) at index 0). However, because shifting the
triple order changes the indices and necessitates updating all references, this positional
dependency causes problems when making code changes like optimization or instruction
reordering. By adding a pointer table that associates indices with the actual triples, indirect
triples address this issue.Rearranging the triples only necessitates updating the pointer table—not
the references—for instance, if the pointer table has the indices [0: (+, a, b), 1: (*, 0, c)]. This
makes the compilation process more reliable and efficient by facilitating the handling of
intermediate code and offering flexibility in code transformations (such as during loop
optimization or the removal of common-subexpressions).
8. What is the main benefit of employing triples as opposed to quadruples?
Space savings is the main advantage of utilizing triples rather than quadruples for representing
intermediate code. Triples are made up of three fields: operator, argument1, and argument2 (for
example, (+, a, b)). The triple's position in the list (for example, index 0) refers to the result.
Because quadruples have four fields—operator, argument1, argument2, and result (e.g., +, a, b,
t1)—it is no longer necessary to explicitly define temporary variables in a separate result field.To
save space by omitting temporary variable names, the statement t1 = a + b; t2 = t1 * c in
quadruples, for instance, requires two entries with explicit temporaries (+, a, b, t1 and *, t1, c,
t2). In triples, however, it becomes (+, a, b) at index 0 and (*, 0, c) at index 1. Although it could
need extra handling for reordering, this condensed representation still supports code generation
and optimization while using less memory in the compiler, particularly for big programs.
9. Use a three-address code to distinguish between assignment instructions and
assignment statements.
Three-address code (TAC) assignment statements and instructions vary according to the kind of
operation they represent in compiler architecture. A variable is assigned the outcome of an
operation between two operands in an assignment statement, which usually involves a binary
operation. For instance, in TAC, the expression x = y op z (in other words, x = y + z) is
expressed as x = y + z, where op is a binary operator (such as +, -, or *). This is an example of a
single statement that stores the result in x after computing y + z. On the other hand, a unary
operation—such as negation or bitwise NOT—that transforms a single operand is frequently
referred to as an assignment instruction.When op is a unary operator, for instance, x = op y (e.g.,
x = -y) in TAC is expressed as x = -y. The difference is in the number of operands: unary for
instructions (one operand, like y) and binary for statements (two operands, like y and z). The
compiler is better able to manage operator precedence and produce suitable machine code
because to this difference.
10. Describe how x = &y, x = *y, and *x = y vary from one another.
The procedures x = &y, x = *y, and *x = y reflect different pointer-related actions in languages
such as C, each with a separate meaning when it comes to intermediate code generation in a
compiler:
A. x = &y: The operation gives x the address of y. Since & is the address-of
operator in this case, x turns into a pointer that contains y's memory address. For
instance, x will be set to 1000 if y is at address 1000. This might be expressed as
x = addr y in three-address code.
B. x = *y: This operation assigns the value at the address that y points to to x by
dereferenceing y, where y is a pointer. For example, x becomes 42 if y contains
address 1000 and the value at 1000 equals 42. This is expressed as x = *y in
three-address coding.
C. *x = y: In this operation, the value of y is stored at the address that x (a pointer)
points to. For instance, the memory at location 2000 is updated to 50 if x points
to that address and y is 50. *x = y is the representation of this in three-address
code.