Scaling Clustered Evolution: 1 + 1 = 4

This is a follow up on:

Yesterday I got the chance to test run the cluster version of EvoLisa on a 64 bit 4 core machine.
(Thanks to my collegue Ulf Axelsson for the help on this one)

The results from the cluster version are quite interesting.
One could expect that you would get a maximum of 200% performance by using two nodes instead of one.
However, this is not the case, we are getting 400% performance by doing this.

Doubling the CPU capacity and get 4 times as much work done.

How is this possible?

This really confused me for a while.
But the reason is quite obvious once you figure it out.

Let’s assume the following:

* We use one core.
* We are running 1000 generations with 50 polygons over 10 seconds.
* 1 out of 10 mutations are positive.

This gives us approx 100 positive mutations over 10 seconds.

If we add one more core we would get:

* We use two cores in parallel.
* We are running a total of 2000 generations, with 50 polygons per node over 10 seconds.
* 1 out of 10 mutations are positive.

This would give us approx 200 positive mutations over 10 seconds. 
Thus, this would give the expected 200% performance.


We are NOT rendering 50 polygons per core in this case.
Each core is only rendering 25 polygons each.

During those 10 seconds, we are actually able to run 2000 generations instead of 1000 per core, thus, running a total of 4000 generations over 10 sec.
Which in turn results in approx 400 positive mutations during the same time span.

We have doubled the CPU capacity and halved the work that needs to be done per node.
Thus, we get a * 2 * 2 perf boost.

Pretty slick :-)

The 4 core machine was able to render the Mona Lisa image with the same quality as the last image in the evolution series in: 1 min 34 sec!


21 thoughts on “Scaling Clustered Evolution: 1 + 1 = 4”

  1. I was wondering — why don’t you use a roulette-like mechanism to determine the next evolutionary step? You could assign probabilities to changing color, moving, adding & deleting, so it would have the same effect as how it’s done in the code you published, just without the empty steps. This saves some overhead, since every step changes something, and you don’t need any tracking if a change occured and a redraw is needed, you just redraw every time.

  2. Is it just me, or could you also double the render speed with one core, by rendering 25 polygons, generate a background image, and then render the other 25 polygons, and repeat?

    Also, you could play with the numbers a bit more, for example, even with dual core you can let each core use 40 polygons as background, and 10 as foreground, then repeat with 10 other polygons. This might be even faster. No need to randomize all 50 polygons each run, and the potential overhead should also be lower (at least in theory, all i’m running here are thought experiments :D)

  3. Also, this brings me to another idea i’ve been thinking about before. Randomize the same polygon a number of times in a row. Especially when working with a background image, this could be a lot faster then randomizing one of the fifty each time. Especially when a polygon just increased the fitness level, the change a (small) change to the same polygon will increase the fitness level again increases.

    Thus, you let for example randomize one polygon 10 times, if it improved the fitness level, you let it randomize another 10 times. Might be a good idea to stop after anyway, as you don’t want to stay stuck with the same polygon forever.

    This trick can also be repeated on a lower level, for example, if a small movement of the polygon just increased the fitness, try to repeat that step again for a even better match, or when a colour change just had positive effect, make another (smaller) colour change. Having exactly the right colour for a vector is a dangerous thing btw, i’ve seen situations where a single vector was half red half black (thus, dark red), and it couldn’t be moved away from the red square or black square it was placed on.

    You ever thought about limiting the colour pallet to colours found in the original image? Or to change the alpha transparency of vectors? Or to change the order in witch the vectors appear? Not sure if any of that is already be done, but it might be interesting to try that to give the ‘creature’ more buttons to play with :D

  4. @ Matthijs & Dorus

    Yes you are actually right, I didn’t think of that, but what you say makes sense.

    It should be possible to run 25 polygons for 5 sec and then another 25 polygons using the first set as a background for 5 more sec on the same machine and double the perf on a single core since the number of generations executed would double.

    I have to try this :-)

  5. @ Pascal

    The dirty tracking has changed in my MPI version, so the mutate method keeps track if this and loops untill atleast one mutation has occured.

    So the consumer never need to deal with this.

  6. @ Dorus

    Regarding mutating the same polygon multiple times if it results in a fitness increase;

    That also makes sense, I think this might be how directed evolution works.

    This should be applicable on color changes and pretty much every form of mutation.

    e.g. if a polygon gets lighter and that results in a fitness increase, then one could try to make it even more light and see if that helps.

    I guess this might stray away a bit from the evolutionary approach and make it more brute force’ish.
    (It wouldn’t be completely blind evolution any more)

    But it should have a positive effect.

  7. @ Matthijs & Dorus

    Regarding the single core split up again:

    If this works on a single core.
    Then it should also be possible to apply it to each core in this clustered version.

    And infact, it should be possible to apply the same _again_ on each of those partitions.

    And again and again.
    So we just get a huge recursive tree of changes.

    Too bad my brain is unable to coope with this :-)

  8. First of all: congratulations on a very neat idea!

    When you say that you were able to render the Mona Lisa image with the same quality as the last image in the evolution series (1 min 34 sec), I assume you do not mean that 904000 mutations were performed…

    How many mutations per second (per core) do you achieve with your latest code?

    (I have started to implement a similar version using Python [which is bound to be significantly slower, even though I use externally compiled libraries] and achieve currently approximately 60 mutations per second.)

  9. @André Roberge

    No the numbers in the text are only examples to show the concept.
    (the time to complete the mona lisa image on a 4 core machine is real and so is the 4 times perf, but the generations per second etc are example values to show the concept only)

    The original Mona Lisa app was far from optimized and the mutation rates and factors was quite bad.

    So those have been finetuned.

    And since the generations are now partitioned to smaller parts that run in paralell its hard to compare it to the generations in the single core original.

    e.g. when using 4 cores, it’s more like 4 different organisms co-evolving that together create the entire image.

    And all of those 4 have independent generations.

  10. About evolution vs brute force. You’d be surprised how many organisms push there evolution in the right direction. Parts of there DNA have been sealed shut, and when the slightest mutation touches it, it’s corrected again right away by the cells, other parts of the DNA is build in ways that it’s more likely to change then it is to stay the same (well, ok, that might be a bit exaggerate). For example, by dogs, there sense of smell is changing rapidly, while humans are losing there sense of smell, our brains are allowed to evolve rapidly. Now with rapidly you have to think in the time frame of millennia, not decades, but anyway :D

    This might sound like cheating a bit, but hey, it’s survival of the fittest, so, only the biggest cheater wins :D

    Back to your repeating the trick on one cpu thought. Yes, that’s possible, it’s what i meant with using 10 out of 50 instead of 25 out of 50 polygons. You can even use 1 of the 50, but changes are this is not going to give a very interesting evolution path, as one polygon doesn’t give a lot of room for a better match.
    Also, the biggest reason you’re probably speeding up 4 times is that it only has to draw half the polygons, thus a core can draw the image twice as fast, using the background image for half the polygons, and drawing the other half himself. Now if you repeat the splitting, you come at a turn point where making the background image takes more time then it wins. When you are making 1000 evolutions on one background image, the time to create the background image isn’t that important, when you split it up, and try to make a new background image every 10 evolutions, it only has 10 evolutions to pay itself back. Anyway, the only real reason to find out if any of this is correct, is ofcourse to program it, and play with the different values ;)

  11. What about using CUDA/Close To Metal/GPGPU/BrookGPU/Stream processing?

    Given that you are using the same algorithm to process different data, it qualifies for SIMD.
    Actually I think that all genetic algorithms qualify for this paradigm..

  12. There is actually a cometition going on at the NVida forums where some guys are doing this in CUDA.

    And apparently their approach will kick my MPI versions butt :-)

    but I haven’t tried it myself yet.

  13. About my CUDA version… I said “when it’s finished”. ;) So far it isn’t. So you’re still king of the mountain.

  14. So, any plans to release updated versions of the program for download? I still keep playing with the old version, as it’s very interesting to watch it work. It would be wonderful to have the new, much faster version!

  15. At you can find a file with Dan Bystrom’s improvements, plus an additional improvement (using uints instead of doubles when the fitness drops below 3/4 of uint.MaxValue). I added an additional feature which could be implemented. It allows an interlaced subset of the pixels to be tested for fitness. I also included an open-source C program I found on the Internet that generates fast random numbers, but that would not help improve the program much speed-wise due to other bottlenecks in the system, so I did not implement it. Right now the renderer seems to be the major bottleneck in the code. You may want to try using System.Windows.Shapes.Polygon instead of System.Graphics for the polygons, since then each polygon could be rerendered at will individually. Right now, rendering is the major bottleneck in the code after the updates are applied (Around 350 renders occur per second on my computer, whereas 3,500 mutates or 1,750 fitness evaluations can occur in the same time period [I tested that by commenting out code temporarily and counting to ten while the algorithm was running. The margin of error was around 10%, which is negligible in this case]). By the way, I think that using uints instead of doubles was the 60% improvement that Dan was talking about, although the improvement would be more along the lines of 3X if the Renderer was faster.

  16. Here is an ongoing discussion on improving your program, possibly by 25X or more. Also, have you ever thought of having the background be a polygon, and having the color mutate? That way the alpha could also be preserved from PNG’s (albeit with a 33% slowdown in the fitness calculation due to all four stats [RGBA] being calculated instead of just RGB). Also, it would be unnecessary to have many polygons just to make a non-black background.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s