Compilati ON: Process
Compilati ON: Process
PROCESS
SOFTWARE
DEFINITION - Software is a general term for the various kinds
of programs used to operate computers and related devices. (The term hardware describes the physical aspects of computers and related devices.) Software can be thought of as the variable part of a computer and hardware the invariable part. Software is often divided into application software (programs that do work users are directly interested in) and system software (which includes operating systems and any program that supports application software). The term middleware is sometimes used to describe programming that mediates between application and system software or between two different kinds of application software (for example, sending a remote work request from an application in a computer that has one kind of operating system to an application in a computer with a different operating system). An additional and difficult-to-classify category of software is the utility, which is a small useful program with limited capability. Some utilities come with operating systems. Like applications, utilities tend to be separately installable and capable of being used independently from the rest of the operating system.
2
TYPES OF SOFTWARE
Practical computer systems divide software systems into three major classes[]: system software, programming software and application software, although the arbitrary, and often blurred. [A] System software System software helps run the computer hardware and computer system. It includes a combination of the following:
The purpose of systems software is to unburden the applications programmer from the often complex details of the particular computer being used, including such accessories as communications devices, printers, device readers, displays and keyboards, and also to partition the computer's resources such as memory and processor time in a safe and stable manner. Examples are- Windows XP, Linux, and Mac OS X. [B] Programming software Programming software usually provides tools to assist a programmer in writing computer programs, and software using different programming languages in a more convenient way. The tools include:
An Integrated development environment (IDE) is a single application that attempts to manage all these functions. [C] Application software Application software allows end users to accomplish one or more specific (not directly computer development related) tasks. Typical applications include: Industrial automation Business software Video games Quantum chemistry and solid state physics software Telecommunications (i.e., the Internet and everything that flows on it) Databases Educational software Medical software Military software Molecular modelling software Image editing Spreadsheet Simulation software Word processing Decision making software
Application software exists for and has impacted a wide variety of topics.
COMPILER
A compiler is a computer program (or set of programs) that transforms source code written in a computer language (the source language) into another computer language (the target language, often having a binary form known as object code). The most common reason for wanting to transform source code is to create an executable program. The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a lower level language (e.g., assembly language or machine code). A program that translates from a low level language to a higher level one is a decompiler. A program that translates between high-level languages is usually called a language translator, source to source translator, or language converter. A language rewriter is usually a program that translates the form of expressions without a change of language. A compiler is likely to perform many or all of the following operations: lexical analysis, preprocessing, parsing, semantic analysis, code generation, and code optimization. Program faults caused by incorrect compiler behaviour can be very difficult to track down and work around and compiler implementers invest a lot of time ensuring the correctness of their software. The term compiler-compiler is sometimes used to refer to a parser generator, a tool often used to help create the lexer and parser.
The process of compilation is quite complex. We can view the compilation process to be consisting of a series of sub process called phases. Each phase takes as input one representation of the source program and produces as out put another representation. Two important aspects of process of compilation are :(a) Generate code to implement meaning of source program according to execution domain. (b) Provide diagnostics(error checking features) to detect violations PL rules in the source program STAGES FROM SOURCE TO EXECUTABLE
1. Compilation: source code ==> relocatable object code
(binaries) 2. Linking: many relocatable binaries (modules plus libraries) ==> one relocatable binary (with all external references satisfied) 3. Loading: relocatable ==> absolute binary (with all code and data references bound to the addresses occupied in memory) 4. Execution: control is transferred to the first instruction of the program At compile time (CT), absolute addresses of variables and statement labels are not known. In static languages (such as FORTRAN), absolute addresses are bound at load time (LT). In block-structured languages, bindings can change at run time (RT).
syntactic structures, typically represented by a parse tree. The parser may be replaced by a syntax-directed editor, which directly generates a parse tree as a product of editing. 2. Semantic analysis: intermediate code is generated for each syntactic structure. Type checking is performed in this phase. Complicated features such as generic declarations and operator overloading (as in Ada and C++) are also processed. 3. Machine-independent optimization: intermediate code is optimized to improve efficiency. 4. Code generation: intermediate code is translated to relocatable object code for the target machine. 5. Machine-dependent optimization: the machine code is optimized.
PHASES OF COMPILATION
MEMORY ALLOCATION
Memory binding/allocation is an association between the memory address attribute of a data item and the address of a memory area. Memory binding can be static or dynamic in nature. STATIC MEMORY ALLOCATION Static memory allocation refers to the process of allocating memory at compile-time before the associated program is executed, unlike dynamic memory allocation or automatic memory allocation where memory is allocated as required at run-time. An application of this technique involves a program module (e.g. function or subroutine) declaring static data locally, such that these data are inaccessible in other modules unless references to it are passed as parameters or returned. A single copy of static data is retained and accessible through many calls to the function in which it is declared. Static memory allocation therefore has the advantage of modularising data within a program design in the situation where these data must be retained through the runtime of the program. The use of static variables within a class in object oriented programming enables a single copy of such data to be shared between all the objects of that class. Object constants known at compile-time, like string literals, are usually allocated statically. In object-oriented programming, the virtual method tables of classes are usually allocated statically. A statically defined value can also be global in its scope ensuring the same immutable value is used throughout a run for consistency.
10
11
LEXICAL ANALYSIS
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, laxer or scanner. A laxer often exists as a single function which is called by a parser or another function.
Lexical grammar
The specification of a programming language will often include a set of rules which defines the laxer. These rules are usually called regular expressions and they define the set of possible character sequences that are used to form tokens or lexemes. White space, (i.e. characters that are ignored), are also defined in the regular expressions.
Token
A token is a string of characters, categorized according to the rules as a symbol (e.g. IDENTIFIER, NUMBER, COMMA, etc.). The process of forming tokens from an input stream of characters is called (tokenization) and the laxer categorizes them according to a symbol type. A token can look like anything that is useful for processing an input text stream or text file. A lexical analyzer generally does nothing with combinations of tokens, a task left for a parser. For example, a typical lexical analyzer recognizes parenthesis as tokens, but does nothing to ensure that each '(' is matched with a ')'. Consider this expression in the C programming language: Sum=3+2;
12
lexem token type e sum = 3 + 2 ; Identifier Assignment operator Number Addition operator Number End of statement
Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer generator such as lax. The lexical analyzer (either generated automatically by a tool like lax , or hand-crafted) reads in a stream of characters, identifies the lexemes in the stream, and categorizes them into tokens. This is called "tokenizing." If the laxer finds an invalid token, it will report an error. Following tokenizing is parsing. From there, the interpreted data may be loaded into data structures for general use, interpretation, or compiling.
13
Tokenizer
Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input. Take, for example, the following string.
The quick brown fox jumps over the lazy dog Unlike humans, a computer cannot intuitively 'see' that there are 9 words. To a computer this is only a series of 43 characters. A process of tokenization could be used to split the Sentence into word tokens. Although the following example is given as XML there are many ways to represent tokenized input: <Sentence> <Word>the</word> <Word>quick</word> <Word>brown</word> <Word>fox</word> <Word>jumps</word> <Word>over</word> <Word>the</word> <Word>lazy</word> <Word>dog</word> </sentence> A lexeme, however, is only a string of characters known to be of a certain kind (e.g., a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a value. The lexeme's type combined with its value is
14
SYNTAX ANALYSIS
Parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens (for example, words), to determine its grammatical structure with respect to a given (more or less) formal grammar. Parsing is also an earlier term for the diagramming of sentences of natural languages, and is still used for the diagramming of inflected languages, such as the Romance languages or Latin. The term parsing comes from Latin pars meaning part.
Parser
PARSER s one of the components in an interpreter or compiler, which checks for correct syntax and builds a data structure (often some kind of parse tree, abstract syntax tree or other hierarchical structure) implicit in the input tokens. The parser often uses a separate lexical analyser to create tokens from the sequence of input characters. Parsers may be programmed by hand or may be (semi-)automatically generated (in some programming languages) by a tool (such as Yacc) from a grammar.
15
Unlike humans, a computer cannot intuitively 'see' that there are 9 words. To a computer this is only a series of 43 characters. A process of tokenization could be used to split the Sentence into word tokens. Although the following example is given as XML there are many ways to represent tokenized input: <Sentence> <Word>the</word> <Word>quick</word> <Word>brown</word> <Word>fox</word> <Word>jumps</word> <Word>over</word> <Word>the</word> <Word>lazy</word> <Word>dog</word> </sentence> A lexeme, however, is only a string of characters known to be of a certain kind (e.g., a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a value. The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. Parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens (for example, words), to determine its grammatical structure with respect to a given (more or less) formal grammar. Parsing is also an earlier term for the diagramming of sentences of natural languages, and is still used for the diagramming of inflected languages, such as the Romance languages or Latin. The term parsing comes from Latin pars meaning part. PARSER s one of the components in an interpreter or compiler, which checks for correct syntax and builds a data structure (often some kind
16
of parse tree, abstract syntax tree or other hierarchical structure) implicit in the input tokens. The parser often uses a separate lexical analyser to create tokens from the sequence of input characters. Parsers may be programmed by hand or may be (semi-)automatically generated (in some programming languages) by a tool (such as Yacc) from a grammar.
17
S Ax Aa Ab
/ \ A | a x
Bottom up example
A bottom up parser is trying to go backwards, performing the following reverse derivation sequence: ax Ax S Intuitively, a top-down parser tries to expand nonterminals into right-hand-sides and a bottom-up parser tries to replace (reduce) right-hand-sides with nonterminals. The first action of the bottom-up parser would be to replace a with A yielding Ax. Then it would replace Ax with S. Once it arrives at a sentential form with exactly S, it has reached the goal and stops, indicating success. Just as with top-down parsing, a brute-force approach will work. Try every replacement until you run out of right-hand-sides to
18
replace or you reach a sentential form consisting of exactly S. While not obvious here, not every replacement is valid and this approach may try all the invalid ones before attempting the correct reduction. Backtracking is extremely inefficient, but as you would expect lookahead proves useful in reducing the number of "wrong turns."
19
The term three-address code is still used even if some instructions use more or fewer than two operands. The key features of threeaddress code are that every instruction implements exactly one fundamental operation, and that the source and destination may refer to any available register. A refinement of three-address code is stat
CODE OPTIMIZATION
Although the word "optimization" shares the same root as "optimal," it is rare for the process of optimization to produce a truly optimal system. The optimized system will typically only be optimal in one application or for one audience. One might reduce the amount of time that a program takes to perform some task at the price of making it consume more memory. In an application where memory space is at a premium, one might deliberately choose a slower algorithm in order to use less memory. Often there is no one size fits all design which works well in all cases, so engineers make trade-offs to optimize the attributes of greatest interest. Additionally, the effort required to make a piece of software completely optimalincapable of any further improvement is almost always more than is reasonable for the benefits that would be accrued; so the process of optimization may be halted before a completely optimal solution has been reached. Fortunately, it is often the case that the greatest improvements come early in the process. Levels" of optimization Optimization can occur at a number of "levels":
Design level
At the highest level, the design may be optimized to make best use of the available resources. The implementation of this design will benefit from a good choice of efficient algorithms and the implementation of these algorithms will benefit from writing good quality code. The architectural design of a system overwhelmingly
20
affects its performance. The choice of algorithm affects efficiency more than any other item of the design and, since the choice of algorithm usually is the first thing that must be decided, arguments against early or "premature optimization" may be hard to justify. In some cases, however, optimization relies on using more elaborate algorithms, making use of 'special cases' and special 'tricks' and performing complex trade-offs. A 'fully optimized' program might be more difficult to comprehend and hence may contain more faults than unoptimized versions (although it is doubtful that this has ever been proven to be the case and therefore remains anecdotal but nevertheless frequently cited.
Avoiding poor quality coding can also improve performance, by avoiding obvious 'slowdowns'. After that, however, some optimizations are possible that actually decrease maintainability. Some, but not all, optimizations can nowadays be performed by optimizing compilers. .
Compile level
Use of an optimizing compiler tends to ensure that the executable program is optimized at least as much as the compiler can predict.
Assembly level
At the lowest level, writing code using an assembly language, designed for a particular hardware platform will normally produce the most efficient code since the programmer can take advantage of the full repertoire of machine instructions. The operating systems of most machines have been traditionally written in assembler code for this reason. With more modern optimizing compilers and the greater complexity of recent CPUs, it is more difficult to write code that is optimized better than the compiler itself generates, and few projects need resort to this 'ultimate' optimization step. However, a large amount of code written today is still compiled with the intent to run on the greatest percentage of machines
21
possible. As a consequence, programmers and compilers don't always take advantage of the more efficient instructions provided by newer CPUs or quirks of older models. Additionally, assembly code tuned for a particular processor without using such instructions might still be suboptimal on a different processor, expecting a different tuning of the code.
Run time
Just in time compilers and Assembler programmers may be able to perform run time optimization exceeding the capability of static compilers by dynamically adjusting parameters according to the actual input or other factors. Platform dependent and independent optimizations Code optimization can be also broadly categorized as platformdependent and platform-independent techniques. While the latter ones are effective on most or all platforms, platform-dependent techniques use specific properties of one platform, or rely on parameters depending on the single platform or even on the single processor. Writing or producing different versions of the same code for different processors might be needed therefore. For instance, in the case of compile-level optimization, platformindependent techniques are generic techniques (such as loop unrolling, reduction in function calls, memory efficient routines, reduction in conditions, etc.), that impact most CPU architectures in a similar way. Generally, these serve to reduce the total Instruction path length required to complete the program and/or reduce total memory usage during the process. On the other hand, platform-dependent techniques involve instruction scheduling, instruction-level parallelism, data-level parallelism, cache optimization techniques (i.e. parameters that differ among various platforms) and the optimal instruction scheduling might be different even on different processors of the same architecture.
22
Different algorithms Computational tasks can be performed in several different ways with varying efficiency. For example, consider the following C code snippet whose intention is to obtain the sum of all integers from 1 to N: int i, sum = 0; for (i = 1; i <= N; i++) Sum += i; Printf ("sum: %d\n", sum); This code can (assuming no arithmetic overflow) be rewritten using a mathematical formula like: Int sum = (N * (N+1)) >> 1; // >>1 is bit right shift by 1, which is // equivalent to divide by 2 when N is // non-negative Print F ("sum: %d\n", sum); The optimization, sometimes performed automatically by an optimizing compiler, is to select a method (algorithm) that is more computationally efficient, while retaining the same functionality. See Algorithmic efficiency for a discussion of some of these techniques. However, a significant improvement in performance can often be achieved by removing extraneous functionality.
23
CODE GENERTION
A compiler's code generator converts some internal representation of source code into a form (e.g., machine code) that can be readily executed by a machine (often a computer). Sophisticated compilers typically perform multiple passes over various intermediate forms. This multi-stage process is used because many algorithms for code optimization are easier to apply one at a time, or because the input to one optimization relies on the processing performed by another optimization. This organization also facilitates the creation of a single compiler that can target multiple architectures, as only the last of the code generation stages (the backend) needs to, code generation is the process by which change from target to target. (For more information on compiler design, see Compiler.) The input to the code generator typically consists of a parse tree or an abstract syntax tree. The tree is converted into a linear sequence of instructions, usually in an intermediate language such as three address code. Further stages of compilation may or may not be referred to as "code generation", depending on whether they involve a significant change in the representation of the program. (For example, a peephole optimization pass would not likely be called "code generation", although a code generator might incorporate a peephole optimization pass.)
24
Instruction selection is typically carried out by doing a recursive postured traversal on the abstract syntax tree, matching particular tree configurations against templates; for example, the tree W: = ADD(X, MUL(Y, Z)) might be transformed into a linear sequence of instructions by recursively generating the sequences for t1:= X and t2:= MUL(Y, Z), and then emitting the instruction ADD W, t1, t2. In a compiler that uses an intermediate language, there may be two instruction selection stages one to convert the parse tree into intermediate code, and a second phase much later to convert the intermediate code into instructions in the ISA of the target machine. This second phase does not require a tree traversal; it can be done linearly, and typically involves a simple replacement of intermediate-language operations with their corresponding opcodes. However, if the compiler is actually a language translator (for example, one that converts Eiffel to C),
25
then the second code-generation phase may involve building a tree from the linear intermediate code.
BOOK KEEPING
A compiler need s to collect information about all data objects that appear in the source program for e.g., a compiler needs to know whether a variable , integer or a real number what size an array has, how many arguments a function excepts n so forth. The information about data objects is collected by the initial phases of the compiler, lexical and syntactic analysis, and entered into the symbol table. For e.g., when a lexical analyser. sees an identifier SUM, say, it may enter the name SUM into the symbol table if its not already there, an produce as output a token whose value component is an index to this entry to symbol table
26
ERROR HANDLING
One of the most important functions of a compiler is the detection and reporting of errors in the source program. The error message should allow the programmer to determine exactly where the errors have occurred. Error can be encountered by virtually all phases of a computer. During compilation, a compiler will find errors such as lexical, syntax, semantic, and logical errors. Exception handling is a programming language construct or computer hardware mechanism designed to handle the occurrence of exceptions, special conditions that change the normal flow of program execution. Programming languages differ considerably in their support for exception handling as distinct from error checking. In some programming languages there are functions which cannot be safely called on invalid input data ... or functions which return values which cannot be distinguished from exceptions. For example in C, the atoi (ASCII to integer conversion) function may return 0 (zero) for any input that cannot be parsed into a valid value. In such languages the programmer must either perform error checking (possibly through some auxiliary global variable such as C's err no) or input validation (perhaps using regular expressions). The degree to which such explicit validation and error checking is necessary is in contrast to exception handling support provided by any given programming environment. Hardware exception
27
handling differs somewhat from the support provided by software tools, but similar concepts and terminology are prevalent. In general, an exception is handled (resolved) by saving the current state of execution in a predefined place and switching the execution to a specific subroutine known as an exception handler. Depending on the situation, the handler may later resume the execution at the original location using the saved information. For example, a page fault will usually allow the program to be resumed, while a division by zero might not be resolvable transparently. From the processing point of view, hardware interrupts are similar to resume-able exceptions, though they are typically unrelated to the user's program flow. From the point of view of the author of a routine, raising an exception is a useful way to signal that a routine could not execute normally. For example, when an input argument is invalid (e.g. a zero denominator in division) or when a resource it relies on is unavailable (like a missing file, or a hard disk error). In systems without exceptions, routines would need to return some special code However; this is sometimes complicated by the semi predicate problem, in which users of the routine need to write extra code to distinguish normal return values from erroneous ones. In runtime engine environments such as Java or .NET, there exist tools that attach to the runtime engine and every time that an exception of interest occurs, they record debugging information that existed in memory at the time the exception was thrown (call stack and heap values). These tools are called Automated Exception Handling or Error Interception tools and provide 'rootcause' information for exceptions. Contemporary applications face many design challenges when considering exception handling strategies. Particularly in modern enterprise level applications, exceptions must often cross process boundaries and machine boundaries. Part of designing a solid exception handling strategy is recognizing when a process has failed to the point where it cannot be economically handled by the software portion of the process.
28
29