Everything you need to code a genetic algorithm
A genetic algorithm aims to solve a problem by using a process analogous to biological evolution, but it is important to note that it is not the same. Also, be careful not to confuse genetic algorithms with genetic programming.
The key difference between genetic algorithms and genetic programming is that genetic programming involves actual random generation of programs to solve a problem, and then breeding the ones that are best at solving the problem. With a genetic algorithm, rather than randomly generating a program, instances of a genetic mapping are randomly generated and bred.
A genetic mapping is relationship, determined by the programmer, between a gene sequence, usually a string of 1s and 0s, and a phenotype, some sort of process for solving the problem determined by the programmer. For example, say we are trying to use a genetic algorithm to learn how to navigate through a specific maze, given an entrance point and an exit point.
The "genes" making up our agent will be a long string of numbers, 0-4, each representing the direction the agent should take given a certain combination of walls around it (and 0 meaning stand still). The first number in the sequence will say what the agent should do if it is surrounded by walls, the next will say what it should do if it is surrounded by walls except an opening to the east, the next representing if there is an opening to the north, etc. until the agent has a reaction for every possible combination of walls.
If we represent each wall as a 1 or a 0, and we have four directions, that means we have 24-1=15 genes for our agent, each representing the direction it will try to go for a given combination of walls. First, we'll randomly generate a population of agents, say 100. This means we'll randomly generate 100 strings, 15 characters long, of the numbers 0-4. Our first individuals will probably look something like this:
043214201214303
444112340123042
...
Now, we need a fitness function, a way of rating each agent on how well it performs. Again, there is no formal process for doing this and it is largely up to the programmer and dependent on the specific problem you are trying to solve. First, we want rate higher agents that make it through the maze. Second, we should also award more points to the agents that make it through the maze in the quickest amount of time. Our fitness function would look something like this, where M is the number of moves the agent made and S is a 1 or a 0 indicating if the agent succeeded in going through the maze:
F(M, S) = 1/M * S
However, we have one signficant problem. What do we do if no agents in our initial population make it through the maze at all, so they all score 0? We could wait for mutation to result in some agents that make it through the maze, but this will likely mean burning a lot of cycles waiting to randomly generate a solution. We could change our function so that agents are given higher fitness based on how close to the finish they got, but this could result in giving high fitness to agents that are close to the finish, but not on the right path to reaching it.
This problem emphasizes why choosing a good fitness function is extremely important if your genetic algorithm is ever going to solve the problem. A better solution would be to reward the agent for the number of unique cells in the maze they visit. This way the agent gets higher fitness for exploring, and agents that just traverse the same area of the maze over and over will quickly die off. New fitness function, given U, the number of unique cells visited:
F(M, U, S) = 1/M * S + (1-1/U)
Notice that rather than simply multiplying by U, we add (1-1/U). Why? First, we don't want just multiply by U, because then agents that didn't make it through the maze but did a lot of exploring would still get a fitness score of 0. Second, we don't want to simply add U, because this could result in agents that explored but didn't succeed having a higher fitness score than agents who actually made it through the maze.
Now we can run all of our agents through the maze and score them. Then we have to figure out how to handle natural selection -- how do we determine which individuals are fit to breed? The most obvious solution is to just choose the top X individuals, say the better half of the population (as it turns out, if you want to have each generation have the same population, you should breed half of the individuals of the population -- why is explained further on). It turns out this is a really bad idea.
Why? When coding a GA you're goal is to reach convergence -- a point at which the population has become mostly stable, and is mostly homogenous, in which most agents can solve the problem. If you only select X top individuals, you will encounter premature convergence -- the population will become homogenous very quickly, probably just copies of the first agent to get a decent fitness score, often not the actual best solution the GA can come up with.
A better solution (and a more realistic one) is to use Roullete Selection in which each agent has a probability of breeding. How probable depends on the agents fitness score. This way the most succesfull agents still breed, but there is still a chance for the lower end of the population. This works better because it avoids prematurely giving the first working solution total control, so that instead the most optimal solution can be found. Additionally, one may employ Rank Selection if there is a large disparity between fitness scores.
So, with Roullete Selection we figure we're in the clear and ready to get breeding. Wrong. We most likely will still fail. It has been mathematically proven that when working with binary genotypes (granted, our genotype is not binary, containing 2s, 3s, 4s, but empirically it has been shown to be true in general) convergence will never happen unless we also employ Elitism into our selection process1.
The problem with straight Roullete selection is that since it works on random chance, there is always the possibility we'll throw away the most fit members of the population. Elitism simply gives a guaranteed seat in the next population to the top X individuals, usually a small number proportional to the population, say 4 in a population of 100.
OK, now we can begin breeding. The most common form of breeding used in GAs is genetic crossover. This method has the peculiar property that it always generates more than 1 child. To keep things simple, we will not have omnidirectional sex, and only let children be the result of 2 parents. Once we've used the selection process to determine which individuals we want to breed, we choose a random number, in our case 1-15, to choose as our cutoff point. This is where we will "chop" the gene sequence for each individual, like so:
Random cut off = 4
A. 0432 B. 14201214303
C. 4441 D. 12340123042
We have now chopped the original 15 character strings each into 2 smaller strings of size 4 and 11. To perform crossover, we combine A and D to make our first new child, and combine C and B to make the second. Tada! We now have offspring.
But the children aren't fully formed yet. If we just used the children we got from crossover, we would once again run into the nasty problem of premature convergence. In order for us to find an optimal solution using just the children resulting from crossover, we're making the assumption that our initial population was big enough and varied enough for it to be possible for crossover to find the optimal solution. However, since the initial population is generated randomly, this is not a given.
To overcome this problem, we introduce another biological analogy, mutation. Mutation can be performed in several ways. The most simple, when dealing with binary genes, is to simply pick a random bit and flip it. If you're dealing with float values for genes, then you may use bias mutation, in which you add a random number in a small range to the gene. In our case, we're dealing with just integer values, so we need to come up with our own method of mutation. We'll simply pick one random character in the string and assign it to be a new random number from 0-4.
Not all children should be mutated -- mutation hinders more often than it helps an offspring. Usually the programmer will pick a constant chance to use for the offspring being mutated, say 10%. This way mutation still occurs, but not so often that it destroys good solutions.
Now we simply repeat the process over and over until we observe that the fitness function has converged. Typically the programmer defines a value for the fitness function that is considered satisfactory and the program quits when this has been reached, or adds a line of code to print out the fitness of the best individual and stops the program by hand once this value has peaked.
Mathematical foundations, and how to improve your code
GAs are typically looked down upon in the machine learning community because unlike other methods such as support vectory machines and hidden markov models, GAs don't have a rigorous mathematical basis. Convergence is not guaranteed. Many academics look at GAs as, "what you try when you haven't studied the problem enough." Mathematically proving when GAs will or will not work is difficult because a lot of the work is arbitrary -- there is no set in stone algorithm for choosing the mapping for a specific problem, and there are several different types of selection, breeding, and mutation.
That said, there have been some attempts to give GAs a mathematical foundation with Schema Theory. From my lecture notes, "A schema is simply a string similar to our gene sequence, like, ***10**. It has been proven that schema's with short fixed regions tend to increase exponentially, if average instances of those schema have above average fitness."2
Um, okay, you say, but what does that mean for me? One implication that this has is that you should construct a mapping that maximizes the adjacency property. This means that your GA will perform better if you have genes that are related next to each other. This is not really applicable to the maze example, but in a more complicated environment a sequence of genes may represent several different behaviors. In this case it is better to have genes controlling a specific behavior next to each other. Your chances of reaching convergence and an optimal solution increase.
Where to go from here
Genetic algorithms can be mixed with other forms of machine learning. For example, the genes could represent the weights in a neural net. Or the number of hidden states in a series of hidden markov models. Coming up with good mappings is considered somewhat of a black art, so experiment!
There is also a wealth of literature on genetic algorithms and genetic programming. A quick trip the library or a google search can have you drowning in useful information. Also, Machine Learning journals frequently have examples of researchers who are applying GAs to different problems.
If you're mathematically gifted, take a crack at expanding on Schema Theory or coming up with a different justification entirely. The more the situations in which genetic algorithms work and don't work are firmly understood, the less time programmers will waste and the more problems will be solved.
1http://www.quantlet.com/mdstat/scripts/csa/html/node53.html
2Lecture Notes for my Machine Learning Course at
Kalamazoo College, taught by
Nathan Sprague. (Also used throughout this article).
Other Sources: The textbook
Aritificial Intelligence: A Modern Approach by Russel and Norvig.
Website: http://aima.cs.berkeley.edu/