Hill climbing, GP, GA, EP, ES

[Edit]
See Maarten’s comments for more info
[/Edit]

There have been allot of discussions about what sort of algorithm the EvoLisa app uses.
Quite a few have claimed that it is a hill climber.

So I think it’s time to set everything straight here.

This is the definition of Hill climbing from wikipedia:

Hill climbing attempts to maximize (or minimize) a function f(x), where x are discrete states. These states are typically represented by vertices in a graph, where edges in the graph encode nearness or similarity of a graph. Hill climbing will follow the graph from vertex to vertex, always locally increasing (or decreasing) the value of f, until a local maximum (or local minimum) xm is reached. Hill climbing can also operate on a continuous space: in that case, the algorithm is called gradient ascent (or gradient descent if the function is minimized).*.

If we look at the EvoLisa app; there are no vertices, there is no graph, there are no edges that encodes nearness.
There is no continious space.

There is an individual with its own set of DNA, and that individual can breed offspring that can be quite far from the parent.
e.g. Each polygon in a drawing could change shape and size, each polygon could change color.
(The likeliness that it happens might be extremely small, but it is not impossible)
There are no paths that can be followed or backtracked like in a graph.

So if we stick to the wikipedia definition of hill climbing, then we can safely say that EvoLisa does not use a hill climber.

So what about my claims about Genetic Programming?

The wikipedia definition of GP is:

GP evolves computer programs, traditionally represented in memory as tree structures. Trees can be easily evaluated in a recursive manner. Every tree node has an operator function and every terminal node has an operand, making mathematical expressions easy to evolve and evaluate. Thus traditionally GP favors the use of programming languages that naturally embody tree structures (for example, Lisp; other functional programming languages are also suitable).

Non-tree representations have been suggested and successfully implemented, such as linear genetic programming which suits the more traditional imperative languages [see, for example, Banzhaf et al. (1998)]. The commercial GP software Discipulus,[6] uses AIM, automatic induction of binary machine code[7] to achieve better performance.[8] MicroGP[9] uses a representation similar to linear GP to generate programs that fully exploit the syntax of a given assembly language

OK, the first statement says GP evolves computer programs, that is sort of fuzzy, a program can be allot more than computational expressions.
But the next highlighted part implies that GP should deal with computational AST’s only.

IMO, for loops, variable declarations and other statements could very well be used in GP .
An AST that contains statements for drawing polygons onto a canvas when evaluated is also a program.
I could very well create a parser for a language that generates that very same AST and everyone using it would agree that it is a language and that they are writing programs when using it.

But let’s ignore what I think, lets stick with the definition here.

If we accept that GP only deals with computational programs and not imperative execution or declarations, then EvoLisa is not GP.

So what would be the correct definition of the algorithm then?

The closest definition would be:
http://en.wikipedia.org/wiki/Evolutionary_programming
or
http://en.wikipedia.org/wiki/Evolution_strategy

EvoLisa fits the 1+1 Evolutionary Strategy.
But it also fits the μ + μ EP since it relies on an AST with mutable values.

Which one of those two it is depends on what definition you set for “a program”
When is an AST an AST and not just data?

e.g. are these statements data or code? “DrawPolygon ( … )” or “print ‘foo'” .

//Roger

12 thoughts on “Hill climbing, GP, GA, EP, ES”

  1. More importantly, what it does is more significant than what it is called. If there is agreement of the underlying behaviour, there is little reason to argue over which bucket to put it in.

    That doesn’t stop people. There were huge arguments over whether or not Doom was 2D, 2.5D or 3D, none of the newsgroup combatants disagreed as to the capabilities of the engine, just what to call it.

    All in all very silly.

  2. Technically, what you do is called a ‘stochastic hillclimber’. The 1+1 ES also falls in this category. The 1+1 ES is very old (early 1960’s), and is mainly studied for theoretical reasons.

    There’s no particular stigma attached to using a hillclimber in Evolutionary Computation, though you might, with minimal changes, get a better, more ‘genetic’, algorithm.

    In practice, the most simple ES used in the field is the 1+6 ES, i.e., generate 6 offspring, keep the best of the 7 (including the parent). This also allows co-adaptation of mutation rates (increase mutation rate if more than one in five generations leads to an improvement, decrease mutation if less than one in five generations leads to an improvement).

    ES/EP typically operate in the space of real-valued parameter optimization (your algorithm has some of that). Genetic Algorithms typically operate on discrete valued ‘data’ points, often (but not necessarily) encoded in bitstrings (your algorithm has some of that too). Genetic Programming works with programs, of whatever form (your algorithm has some resemblance to that as well). These are indeed just names, what is important here is the level of accuracy you achieve, and the level of compression.

    What I’m really curious about is how much compression you get. Say you run Mona-Lisa, zip the coefficients of the polygons, how big is the resulting file?

    Just a few thoughts…

    -Maarten-

  3. The definition of ’stochastic hillclimber’ states that it picks one of the positive vertices at random.

    Not that it creates a _new_ random positive vertex.

    I could very well be wrong on this one, but AFAIK, hill climbers, even stochastic ones, operate on graphs.

  4. You don’t need to use an explicit graph in order to be ‘accused’ of traversing it. Search is usually defined on graphs in order to reason about search in the form of a ‘search tree’.

    In your case, the graph is implicit, and of high branching factor. You can however define a graph on this space, and reason that you implicitly traverse this graph. If you have a solution (parent), you can in principle enumerate all possible children, and reason that your method of generating one of them at random is an optimization of an equivalent graph-based algorithm.

    You can define a search algorithm in one of two ways. One by defining a graph and traverse the vertices, another by defining a representation and operators (mutation/crossover) on these representations. This latter process implicitly defines a neighbourhood function on an implicit graph. This approach is usually taken in problems with a large branching factor. Mathematically, the two approaches are equivalent. In practice, a graph-based algorithm is not always feasible.

    This is an ‘in principle’ thing, not practical in any way. But then again, these in principle arguments are valid.

    There’s however little point to argue this (even though I just did :). The work you show is extremely interesting and, due to how the algorithm is set up, extremely suited for a variety of algorithms: stochastic hillclimbers, simulated annealing, evolutionary strategies, genetic algorithms and genetic programming.

    Most of the art of optimization lies in the definition of a representation, a fitness function, and the operators acting on the representation. Your method is cool because it shows how a very simple representation can be used to approximate a picture. There’s probably ample opportunity for devising operators that traverse this space of polygons more efficiently, but the proof of principle stands.

    I’m only aware of work that tries to evolve Iterated Fractal Systems (IFS) for approximating pictures. These did not give good results. If your method can lead to high compression of pictures in feasible time, there’s something quite interesting going on here.

    If I had the time, I would love to try out some set of other algorithms on this problem domain. It seems to be possible.

    Whether your algorithm use a stochastic hillclimber or a population of points is beside the point, really. You have a representation, a fitness function, and a whole set of possible operators on these representations. And you have results.

    This possibly opens up a whole new application area of image compression, for which I congratulate you.

    -Maarten-

  5. Although I haven’t studied the code in depth, I have to agree with Maarten. I think “stochastic hill climbing algorithm” is probably the most accurate way to describe what you’ve done.

    That’s not to take away from your achievement: you’ve stumbled upon a fantastic way to visualize these types of optimization algorithms (especially in movie form).

    I’d love to see you apply more advanced genetic programming techniques. For instance, make each polygon an individual in the population (i.e. the population size would be 50 instead of 2). Encode the polygon’s attributes as a string of 1’s and 0’s, their DNA. Let the polygons breed; allow the population size to naturally fluctuate and equilibrate. Incorporate exchange of DNA via recombination. Have a mutation rate when generating offspring. Allow only the most fit organisms to survive (n.b. your definition of fitness will have to change to be applicable on a single polygon level). You can try playing with different organism and environment constraints. In the end you might have some wicked movies.

  6. @Maarten

    “In your case, the graph is implicit, and of high branching factor. You can however define a graph on this space”

    I agree that it is possible to define a graph on this space, and that it is mathematically the same.

    Anyway, if implicit graphs are allowed in hill climbers, then what would the difference between a 1+X ES/EP and a hill climber be?

    Or even a X+Y ES/EP.

    Those would in that case just be a pool of hill climbers, climbing the fitness landscape at different positions.

    Under what conditons would a hill climber not be an ES or vice versa ?

  7. @Maarten

    Another thing that I have a hard time understanding here;
    Under what conditions would a 1+X ES/EP be better than a 1+1 ES/EP?

    I get that the mutation rate optimization might be good in theory.
    But in practice, it is the fitness function that will be eating CPU cycles.
    And evaluating x children and then discarding all but one of them will never ( ?) be faster than just pick the first positive change you find.

    Also the mutation rate optimization, doesn’t that just make the individ(uals) freeze when they reach local maxima?

    if you get stuck in local maxima, and then start decreasing the mutation rate, then you will most likely never get out of it.
    ( I did try it and the results was horrible compared to the 1+1 version )

  8. As to your point about an 1+1 ES being a stochastic hillclimber. Yes, that is the case. The 1+1 ES is about the simplest thing labeled evolutionary, and interestingly enough, it *is* a stochastic hillclimber. Other 1+1 algorithms are simulated annealing and in general markov chain monte carlo methods.

    The 1+1 ES algorithm was used simply because it needed to be implemented using a coin and pencil and paper. The guys inventing it didn’t have access to a computer (we’re talking 1961).

    After looking through the code (actually, the C++ port found in a CUDA forum), I can see why you’re not getting good results with the more involved ES style algorithm (1+lambda). These algorithms are developed for real-valued parameter optimization: given a vector of fixed size, find the optimal values. The heuristic I gave works in that area, because there is only a single mutation parameter, the standard deviation. The idea is that by doing \lambda tries in the current neighborhood, you collect some statistics for the mutation rate, and this allows you to set it better. This assumes a uniform method of doing mutation. Quite different from yours. I didn’t realize this until I started looking into the method deeper.

    Some underlying assumptions for the 1 in 5 heuristic is that the search space is continuous and that small steps lead to small differences in fitness. This in not necessarily the case with your operators, as adding a brand new polygon can be a huge change. The 1/5 rule is not about probability of doing mutations, it’s about the magnitude of the mutations. In your case it would not be about the probability of adding a polygon or not, but about how you would change the coordinates of existing polygons.

    The 1+lambda form are simply local optimizers (and I would still classify these as stochastic hillclimbers).

    Your representation is however discrete and has variable length. You also have many mutation operators, not just one. That makes this whole ES thing less applicable. What you’re working with looks more like what the field of Genetic Programming is working with (and we’re back to were you started, It’s GP:). The difference however is that with GP (my field), we’re used to work with large populations (500 is considered low), and we use some form of crossover. In your case crossover could simply be a case of taking two arrays of polygons, pick a cutoff in each array, and swap the tails of the arrays.

    Then use a population of 500 individuals, perform steady state tournament selection (pick k individuals at random, select the best. Pick another k individuals at random, select the best. Crossover the two individuals to create a single new individual. Then mutate this individual using your regular mutation parameter. Then pick another k individual, and replace the worst with this new individual. Set k to 5).

    No guarantees that this will work better than your 1+1 algorithm, but it will make it a more global search. It should be easy to implement with what you’ve got at this point.

    What the population should give you is a balance between exploration and exploitation, while the hillclimbers are about exploitation purely. If the crossover operator makes sense in the domain, you might get some improvement using this technique. If not, maybe 1+1 is with this representation/operator combination the best approach.

    -Maarten-

  9. Roger, and now to actually answer your questions instead of giving unsolicited advice:

    > Anyway, if implicit graphs are allowed in hill
    > climbers, then what would the difference between a > 1+X ES/EP and a hill climber be?

    No difference whatsoever

    > Or even a X+Y ES/EP.

    An X+Y ES/EP is different from a simple hillclimber. In that case you do not have a set of parallel hillclimbers anymore, but a population of ‘individuals’ competing for resources. Your resources are X slots. Whenever an individual (climber) is capable of producing offspring that is better than other climbers, the offspring will take over the slot of some unfortunate other climber. If that offspring (as well as the original parent) produces good offspring as well, it will take over even more slots in the population. Finally taking over the entire population. All these hillclimbers that couldn’t improve that efficiently will go the way of the Dodo (quite literally). These hills are no longer climbed.

    However, if you want to state it in good old-fashioned search terms, all this population based search with fancy genetic analogies is not much different from a search technique called ‘beam search’.

    -Maarten-

  10. I started this thread in the CUDA forums because I am highly fascinated by this entire topic. I hope I got some people interested. Here is the URL.

    http://forums.nvidia.com/index.php?showtopic=83775

    Apologies for hotlinking to the images on your blog. Maybe that wasn’t the kindest thing to do.

    I am currently working on a simple CUDA based polygon renderer (convex polygons only, non self intersecting). I am doing this for monochrome images currently. By rendering right in the CUDA kernel
    I can evaluate the error function instantly and abort the tile based rendering as soon as I exceed the threshold for the “best” image so far.

    Christian

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s