Genetic Algorithm Report
Genetic Algorithm Report
SEMINAR REPORT
ON
GENETIC ALGORITHM
Abstract
Genetic algorithms provide heuristic solutions for combinatorial-optimization problems that have found applications in many areas with outstanding success. Genetic algorithms is an optimization technique for searching very large spaces that models the role of the genetic material in living organisms. A genetic algorithm (GA) is a search technique used in computing to find exact or approximate solutions to optimization and search problems. Genetic algorithms are categorized as global search heuristics. Genetic algorithms are a part of evolutionary computing, which is a rapidly growing area of artificial intelligence. It uses techniques inspired by evolutionary biology such as inheritance, mutation, selection, and crossover. A small population of individual exemplars can effectively search a large space because they contain schemata, useful substructures that can be potentially combined to make fitter individuals. Formal studies of competing schemata show that the best policy for replicating them is to increase them exponentially according to their relative fitness. This turns out to be the policy used by genetic algorithms. Fitness is determined by examining a large number of individual fitness cases. This process can be very efficient if the fitness cases also evolve by their own GAs.
1.Introduction
1.1
A Biology Lesson
Every organism has a set of rules, a blueprint so to speak, describing how that organism is built up from the tiny building blocks of life. These rules are encoded in the genes of an organism, which in turn are connected together into long strings called chromosomes. Each gene represents a specific trait of the organism, like eye colour or hair colour, and has several different settings. For example, the settings for a hair colour gene may be blonde, black or auburn. These genes and their settings are usually referred to as an organism's genotype. The physical expression of the genotype - the organism itself - is called the phenotype. When two organisms mate they share their genes. The resultant offspring may end up having half the genes from one parent and half from the other. This process is called recombination. Very occasionally a gene may be mutated. Normally this mutated gene will not affect the development of the phenotype but very occasionally it will be expressed in the organism as a completely new trait.
evolutionary ideas of natural selection and genetic. The basic concept of Genetic Algorithms is designed to simulate processes in natural system necessary for evolution, specifically those that follow the principles first laid down by Charles Darwin of survival of the fittest. As such they represent an intelligent exploitation of a random search within a defined search space to solve a problem. First pioneered by John Holland in the 60s, Genetic Algorithms has been widely studied, experimented and applied in many fields in engineering worlds. Not only does Genetic Algorithms provide an alternative methods to solving problem, it consistently outperforms other traditional methods in most of the problems link. Many of the real world problems involved finding optimal parameters, which might prove difficult for traditional methods but ideal for Genetic Algorithms . However, because of its outstanding performance in optimisation, Genetic Algorithms have been wrongly regarded as a function optimiser. In fact, there are many ways to view genetic algorithms. Perhaps most users come to Genetic Algorithms looking for a problem solver, but this is a restrictive view.
Start
Fig. 1.1
Fig 1.1 Graph represents some search space and goal is to travel from the gray cell to the green cell in the shortest number of steps .
2. Genetic Algorithm
1. [Start] Generate random population of n chromosomes (suitable solutions for the problem) 2. [Fitness] Evaluate the fitness f(x) of each chromosome x in the population 3. [New population] Create a new population by repeating following steps until the new population is complete .
[Selection] Select two parent chromosomes from a population according to their fitness (the better fitness, the bigger chance to be selected) . [Crossover ] In genetic algorithms, crossover is a genetic operator used to vary the programming of a chromosome or chromosomes from one generation to the next. It is analogous to reproduction and biological crossover, upon which genetic algorithms are based. Cross over is a process of taking more than one parent solutions and producing a child solution from them. There are methods for selection of the chromosomes. With a crossover probability cross over the parents to form a new offspring (children). If no crossover was performed, offspring is an exact copy of parents.
Fig 2.1
Figure 2.1 Shows the crossover between parent 1 and parent 2. As we can see, the children take one section of the chromosome from each parent. The point at which the chromosome is broken depends on the randomly selected crossover point. This particular method is called single point crossover because only one crossover point exists. [Mutation] In genetic algorithms of computing, mutation is a genetic operator used to maintain genetic diversity from one generation of a population of algorithm chromosomes to the next. It is analogous to biological mutation. Mutation alters one or more gene values in a chromosome from its initial state. In mutation, the solution may change entirely from the previous solution. Hence Genetic Algorithms can come to better solution by using mutation. With a
Fig 2.2
After selection and crossover, you now have a new population full of individuals. Some are directly copied, and others are produced by crossover. In order to ensure that the individuals are not all exactly the same, you allow for a small chance of mutation. [Accepting] Place new offspring in a new population
4. [Replace] Use new generated population for a further run of algorithm 5. [Test] If the end condition is satisfied, stop, and return the best solution in current population 6. [Loop] Go to step 2
evaluation are the keys of the success in Genetic Algorithms applications. The appeal of Genetic Algorithms comes from their simplicity and elegance as robust search algorithms as well as from their power to discover good solutions rapidly for difficult high-dimensional problems. The search space is large, complex or poorly understood. Domain knowledge is scarce or expert knowledge is difficult to encode to narrow the search space. No mathematical analysis is available. Traditional search methods fail. Genetic Algorithm have been used for problem-solving and for modeling . Genetic Algorithms are applied to many scientific, engineering problems, in business and entertainment, including traveling salesman problem.
4. Applications
4.1. Automotive Design
Using Genetic Algorithms to both design composite materials and aerodynamic shapes for race cars and regular means of transportation (including aviation) can return combinations of best materials and best engineering to provide faster, lighter, more fuel efficient and safer vehicles for all the things we use vehicles for.
Rather than spending years in laboratories working with polymers, wind tunnels and balsa wood shapes, the processes can be done much quicker and more efficiently by computer modeling using Genetic Algorithms searches to return a range of options human designers can then put together however they please.
10
Getting the most out of a range of materials to optimize the structural and operational design of buildings, factories, machines, etc. is a rapidly expanding application of Genetic Algorithms. These are being created for such uses as optimizing the design of heat exchangers, robot gripping arms, satellite booms, building trusses, flywheels, turbines, and just about any other computer-assisted engineering design application.
There is work to combine Genetic Algorithms optimizing particular aspects of engineering problems to work together, and some of these can not only solve design problems, but also project them forward to analyze weaknesses and possible point failures in the future so these can be avoided.
11
4.3 Robotics
Robotics involves human designers and engineers trying out all sorts of things in order to create useful machines that can do work for humans. Each robot's design is dependent on the job or jobs it is intended to do, so there are many different designs out there.
Genetic Algorithms can be programmed to search for a range of optimal designs and components for each specific use, or to return results for entirely new types of robots that can perform multiple tasks and have more general application.
Genetic Algorithm designed robotics just might get us those nifty multi-purpose, learning robots we've been expecting any year now since we watched the Jetsons as kids, who will cook our meals, do our laundry and even clean the bathroom for us !
12
Do you find yourself frustrated by slow LAN performance, inconsistent internet access, a FAX machine that only sends faxes sometimes, your land line's number of 'ghost' phone calls every month? Well, Genetic Algorithms are being
developed that will allow for dynamic and anticipatory routing of circuits for telecommunications networks. These could take notice of your system's instability and anticipate your re-routing needs. Using more than one Genetic Algorithms circuit-search at a time, soon your interpersonal communications problems may really be all in your head rather than in your telecommunications system. Other Genetic Algorithms are being developed to optimize placement and
routing of cell towers for best coverage and ease of switching, so your cell phone and blackberry will be thankful for Genetic Algorithms too.
13
New applications of a Genetic Algorithms known as the "Traveling Salesman Problem" or TSP can be used to plan the most efficient routes and scheduling for travel planners, traffic routers and even shipping companies. The shortest routes for traveling. The timing to avoid traffic tie-ups and rush hours.
Most efficient use of transport for shipping, even to including pickup loads and deliveries along the way. The program can be modeling all this in the background while the human agents do other things, improving productivity as well! Chances are increasing steadily that when you get that trip plan packet from the travel agency, a Genetic Algorithms contributed more to it than the agent did.
is critical to the well being of patients. Analysis of gene expression data leads to cancer identification and classification which will facilitate proper treatment selection and drug development. Gene expression data sets for ovarian, prostate, and lung cancer were analyzed in this research. An integrated gene-search algorithm for genetic expression data analysis was proposed. This integrated algorithm involves a genetic algorithm and correlation-based heuristics for data preprocessing (on partitioned data sets) and data mining (decision tree and support vector machines algorithms) for making predictions. Knowledge derived by the proposed algorithm has high classification accuracy with the ability to identify the most significant genes. Cancer develops mainly in epithelial cells, connecting/muscle tissue (sarcomas), and white blood cells. A successive mutation in the normal cell that damages the DNA and impairs the cell replication mechanism .There are number of carcinogens such as tobacco smoke, radiation, certain microbes, synthetic chemicals, polluted water, and air that may accelerate the mutations. Thus, there is a need to identify the mutated genes that contribute to a cancerous state. One of the methods for cancer identification is through the analysis of genetic data. The human genome contains approximately10 million single nucleotide
polymorphisms. These Single nucleotide polymorphisms are responsible for the variation that exists between human beings. Due to the high cost, genetic data (containing as many as 15,000 genes per patient) is normally collected on a limited number of patients (100300 patients). There is a need to select the most informative genes from such wide data sets . Removal of uninformative genes decreases noise, confusion, and complexity, and increases the chances for identification of the most important genes, classification of diseases, and prediction of various outcomes, e.g., cancer type. A genetic algorithm is a search algorithm based on the concept of natural
genetics. A genetic algorithm is initiated with a set of solutions (chromosomes) called the population .Each solution in the population is evaluated based on its fitness. Solutions chosen to form new chromosomes (offspring) are selected according to the fitness, i.e., the more suitable the solution the higher the
15
likelihood it will reproduce. This is repeated until some condition (for example, the number of populations or quality of the best solution) is satisfied. Genetic algorithm searches the solution space without following crisp constraints and takes into account potentially all feasible solution regions. This provides a chance of searching previously unexplored regions, and there is a high possibility of achieving an overall optimal/near optimal solution, making the genetic algorithm a global search algorithm
6.2 Integrated algorithm The integrated gene-search algorithm consists of two phases. The iterative Phase I includes data partitioning, execution of the Decision Tree algorithm (or other data-mining algorithms) to the partitioned data set, the genetic algorithm, and the correlation-based heuristics for gene reduction. The set of significant genes is utilized in Phase II for validation of the quality of genes. A data-mining (i.e., classification) algorithm takes a training expression data set as input and predict if the test sample is a normal or cancerous. Thus, data-mining algorithms are applied to the training and testing data sets and their results are evaluated to determine the most significant gene set. In Phase I, the cancer training gene data set is initially partitioned into several subsets with approximately 1000 genes in each subset (Fig. 6.1). The partitioning of the data sets can be performed arbitrarily or randomly. The Decision Tree algorithm is applied to each partitioned data set to determine the classification accuracy. The total number of genes selected (most significant as well as medium significant genes) from all the partitioned data sets is an overestimate of the actual significant gene The total number of genes selected from all the partitioned data sets are merged to formulate a single gene set (Fig. 6.1). If the current gene set is more than the user-defined threshold (e.g.,1000 genes), then the gene set is repartitioned to form the next iteration of data-mining and GACFS(Genetic
16
Algorithm-Correlation Based Feature Selection) algorithms. Phase I is repeated until the number of significant genes is less than the threshold. To further reduce the number of genes, the Genetic Algorithm-Correlation Based Feature Selection)algorithm can be re-applied to the reduced gene data sets.
In Phase II, data-mining algorithms such as Decision Tree and Support Vector Machine algorithms are then applied to the training dataset for only the significant genes (Fig. 6.1). The classification accuracy obtained from this reduced gene data set is not smaller than the maximum classification accuracy from the previous partitioned data sets.This step validates the fact that the proposed gene selection algorithm preserves the information/knowledge.
17
Complete data set for cancer Data set Data set Data set
00001 to 01000
01001 to 02000
0i001 to 0i+1000
1n001 to 1n+1000
Phase I
Data mining
Data mining
Data mining
Data mining
GA-CFS
GA-CFS
GA-CFS
GA-CFS
YES
If >1000 NO
Data mining
s
Most significant genes
18
6.3 Conclusion The integrated gene-search algorithm (Genetic Algorithm-Correlation Based Feature Selection algorithm with data mining) was proposed and successfully applied to the training and test genetic expression data sets of ovarian, prostate, and lung cancers. This uniformly applicable algorithm not only provided high classification accuracy but also determined a set of the most significant genes for each of the three cancers. These gene sets require further investigation for their medical relevance, as the prediction power attained from these gene sets is statistically equivalent to that reported in the literature. The integrated gene-search algorithm is capable of identifying significant genes by partitioning the data set with a correlation-based heuristic. The overestimate of the actual significant gene set using this algorithm allows the investigation of potentially useful genes or their combinations. This leads to multiple models and supports the underlying hypothesis that genetic expression data sets can be used in diagnosis of various cancers.
19
20
5.2 How They Are Used There are a variety of problems that can be solved with genetic algorithms. Genetic Algorithm are adept for optimization problems in particular. KSATISFIABILITY problems for example can be solved with a genetic algorithm (though other means exist). For anyone not familiar with K-SATISFIABILITY problems Ill give a short explanation. SATISFIABILITY (or satisfaction) problems attempt to assign values to a boolean formula in such a way that it evaluates to true. So if my SATISFIABILITY problem consists of two variables: A and B and one clause: A OR B then one solution would be A = true, B = true. Clauses are the components of the boolean formula, in the example I gave the formula consists of only one clause. A larger SATISFIABILITY problem may consist of hundreds of variables and thousands of clauses and cannot be solved on paper in a reasonable amount of time. Here is an example of a larger sat problem:
(A OR B OR C) AND (A OR !B OR !C) AND (!A OR B OR !C)
This formula consists of three variables (A, B, C) and three clauses. A solution to this problem would be A = true, B = true, C = false. Notice that there are many different assignments of these variables that satisfy the formula. If there were more clauses this might not be the case.
To solve a SATISFIABILITY problem with a genetic algorithm you start of with a population of randomly generated solutions, each solution consisting of a random assignment of true of false to each variable. This population is generation zero. In this context the fitness function is defined as the number of satisfied (or unsatisfied) clauses in the boolean formula.
Using the fitness function, for each individual in generation zero, a fitness value is determined. It might be the case that one of these individuals satisfies the formula, in which case youre done. Otherwise, in order to get from generation zero to
21
generation one, we must choose a portion of the population to reproduce, for example, those having a fitness above the average.
Once weve made our selection we perform the cross over by producing a new individual with a portion of its assignments coming from each parent (the size of the portion may be determined randomly). For example, for individuals X and Y and X(A,B,C) = {True , False, False} and Y(A,B,C) = {False, True, True} a possible child would be Child(A,B,C) = {False, False, False}.
After weve generated a new population we then randomly mutate each individual at a very low probability. At probabilities above 5% in many cases a solution will not be found in a reasonable amount of time. A mutation takes an assignment and flips it. So for the individual X(A,B,C) = {True, False, False} if a mutation event occurs on the variable B, it will become X(A,B,C) {True, True, False}. Without this mutation the algorithm does not approach a solution.
At Generation zero for a large problem, there is very little chance of a solution existing. After each passing generation, however, the average fitness increases and it becomes likely that an individual satisfies the formula.
5.3 Problems with Genetic Algorithms After each generation the individuals of a population begin to approach the solution. In the context of a SATISFIABILITY problem this means they satisfy more and more clauses. There is, however, no guarantee that they will ever satisfy all of them. This is because individuals that have a fitness near the maximum, may actually be very different from the solution. For example, say a SATISFIABILITY problem has the solution 000011000 where each character in the bit string represents a variable and the 0s represent false, and the 1s represent true. The string 111100111 might satisfy 90% of the clauses. If this is the case, the children produced by this individual will look similar to it and
22
the likelihood of it being mutated into the solution is essentially zero. The following graph illustrates this problem:
Local Max Problem From the graph you can see that there are two peaks, one reaching 100, the other 75. The higher one represents the solution to the problem, while the other is called a local maximum. A genetic algorithm may reach the peak of a local maximum and become stuck because all similar solutions have a lower fitness, while the actual solution is un similar to the current state. 5.4 Possible Solutions A possible way to fix this problem would be to reset the search. Generated a new set of random solutions as the algorithm did at generation zero and proceed from there. This is called a random-reset. Hopefully after the reset the search will approach the solution rather than a local max.
23
Another similar solution would be to mutate each individual in the current population at a much higher rate, possibly 100%. This would produce a population that very different from the one that existed at the local maximum.
These solutions would fix the problem in a case where there were only a few local maximums, but for some problems it might be the case that there are numerous local maximums. For these problems, genetic algorithms with random-reset might find solutions that have very high fitness, but never the solution.
24
25
7. References
26