When posted about the Mona Lisa evolution demo I got a request to make it possible to run it in parallel.
At first I thought; “How hard can it be?”
But it turned out to be more than hard, I’m starting to think that it is impossible.
The first approach that everyone points to is:
1) Run separate islands in parallel.
The idea is to run a separate island per node in a cluster and let that island have it’s own population of evolving individuals.
Add one node and the problem is solved twice as fast, add another one and it’s solved three times as fast.
Let’s say that it takes about 10 to 15 minutes to solve the problem on a single machine, then it is very likely that it will take 10 to 15 minutes on other machines to.
So you will not gain much by adding nodes to this solution since each node will solve the problem in about 10 minutes at best.
No matter if you throw hundreds of nodes at it, the best node will still average in at about 10 minutes.
This is probably the worst solution.
The next solution looks fine in theory:
2) Run islands in parallel and report back to the root node after each generation.
This is a similar solution as the first one, but instead of running them in isolation, they report back to the root node after each generation and let the root decide what individuals to keep or discard.
Let’s say that one out of ten mutations are positive then we increase the chance to get a positive mutation by one tenth per node we add.
So if we have 10 nodes, then we are very likely to get a positive mutation each generation.
This should solve it, right?
NOPE! (sort of)
The problem with this solution is that it is extremely chatty, the performance gain you get from this is eaten by the overhead of the communication.
There is allot of data that needs to be passed around each generation.
This could work well if the fitness function takes allot more time than the communication, but in my case it doesn’t.
3) Partition the workload and use deterministic evolution per node.
I had some big hopes for this one.
The idea was to avoid sending as much data between nodes by using deterministic evolution per node (init the random generator with the same seed).
So that each node evolves the _same_ image, but only renders and calculates fitness for a small part of it and then reports back the partitioned fitness to the root node that then decides if the image should be kept or not.
This way, only the fitness value and an accept message would be passed between nodes and the root.
The communications overhead is sort of small here, so that is all nice.
So if I have two nodes, one would think that it would only take half the time to render each partition than if you only have one node.
However this turned out to be wrong too, since allot of polygons in the image covers the area for more than one node, those polygons have to be (partially) rendered by multiple nodes.
So instead of getting a 100% performance boost from this, the actual performance gain was only about 35%.
35% faster rendering minus the overhead of communication gives this approach a total perf boost of 25-30% per node.
Not that impressive at all, but it does work.
There are also variations on the above approaches, but all of the ones I’ve tried fails on either the communication overhead or bad understanding of probability theory.
One such example is “Don’t communicate that often, make a batch and select the best one from that batch” , that one falls under the probability theory part ;-)
How about making it really genetic?
Each node has a pool of the most fit individuals. One of these is used to generate the next individual that replaces another one in the pool. Of course, the most fit individuals have more chance to procreate and less chance to die. From time to time individuals migrate (or get copied) to another node.
I think you could also do interesting experiments with this:
* How do isolated nodes evolve?
* Give each node a slightly different goal and see what happens.
Instead of trying to solve the entire problem with multiple pc’s, why not split up the problem into parts and let each pc solve that part of the problem?
Example: Lets say we have 4 pc’s
– we divide the entire mona lisa image into 4 equally sized parts (e.g. top-left,top-right,bottom-left,bottom-right_
– hand over a specific part to a pc and let it generate/evolve the best solution for that specific part
when the solution of a single pc is good enough, or when major progress has been made, the pc reports its results back to master which can then compose/update the solution for the entire image
Offcourse the endr esult will contain more polygons then when using a single pc, but in theory you should get much faster results.
You could even go further and let the root-machine evolve the final solution. Its fitness function should be to reduce the polygon count while keeping the fitness of the solution the same
I thought of a variation on Approach 2: Each node does 10,000 generations, and then after that reports back to the root node the fitness and DNA of that final generation. The fittest from all the nodes is then redistributed by the root node to all the other nodes for the next iteration.
This doesn’t result in a faster arrival at an optimum-ish result, but it does kind of parallelize it a little bit. Kind of. Ish.
That solution will suffer from the same problems as my #1.
It will be slightly faster since it distributes the best individual of x generations, but it will still only result in a few percent performance gain.
My next attempt will be to recreate the entire engine so that I get a tree structure with polygons that have a small random offset from the node above it.
This will make it possible to get branches in the tree that renders the face or the background.
And if that works, genetic recombination will be a good option.
But before I have that tree structure, genetic recombination is useless.
Genetic recombination is only good once you have genes that solve/represent a specific task.
This approach gives a linear speed increase, which is very nice.
But it will break the original concept.
The original concept allows polygons to overlap the entire area.
If each worker node is only allowed to evolve a specific region, then all the polygons will be confined to those regions.
This will result in some nasty looking edges between those regions.
Those edges can ofcourse be minimized by letting the root machine evolve the final image as you say.
But the more worker nodes you throw at it, the harder the root machine will have to work in order to remove those edges.
And this will most likely also fail on the original criteria of max 50 polygons.
So, it works but breaks the rules.
And merry x-mas to all!
Merry X-MAS to you too !
Another option would be to use the GPU instead of CPU.
When using the GPU instead of the CPU the application should be tens, if not hunderds times faster (in theory)
NVidia has CUDA which is a C-like language,
but i’m sure ATI/AMD also has something like this.
Offcourse it requires a CUDA capable GFX card, but most newer nvidia cards do support it
But the cluster computers will most likely not have an nvidia card in them :-)
It seems like the ratio of communication overhead to fitness compute time is too large for this algorithm to parallelize well in the ways you’ve mentioned.
If the algorithm is written in C or C++ and there is only very rare File I/O (i.e. to save the jpg), I would expect this to run really fast on a single (dedicated) processor.
The fitness compute can probably be sped up. Again, I haven’t looked at your code, but it seems possible to only compare the parts of the image that have changed, instead of a global pixel-by-pixel comparison. Also, you might try sampling a relatively small number of pixels instead of comparing each one (i.e. see if 100 random pixels improved as opposed to all of them).
The faster the fitness evaluation gets, the larger the communicational overhead will be compared to it.
My current local version is reasonably fast, it does the same job in 7-8 minutes that the original code did in 3 hours.
My problem ATM is that I’m unaware of any other way than those I describe to run this in parallell w/o breaking the original concept.
(50 polygons, and polygons may cover the entire area of the image… w/o any nasty edges between areas that have been partitioned)
I don’t agree that if it takes 1 machine 10 minutes to brute force its way through enough combinations to find a successful mutation, that it would also take 10 minutes for 2 machines to find such a good combination – unless they were searching through the same combination set!
If you were sure that they searched exclusive combination sets, it should take roughly 5 minutes instead (not exactly though, a little bit more, but nearly)…the more overlap you have between the searched combo sets, the more the time will start to grow, from 5 mins until it reaches 10 mins when there is 100% overlap.
….You didn’t give your islands the same random seed, did you? ;-)
Seriously though, the risk for overlap (that the different islands/nodes in the cluster come up with and validate the same mutations) should be fairly small given the pretty big space of possible mutations. Which means that the approach should work fine. But if you do not get good returns on this approach it seems to me that it would probably pay to see how you could reduce the overlap (ensure each island gets unique mutations).
It doesnt matter if you agree or not.
Its probabillity theory, and it does not agree with you ;-)
And I think you misread what I wrote,
I did not say find “a successful mutation”.
Scenario 1 was about running multiple generations in sequence.
(That either runs isolated or reports back after x generations)
And that works poorly simply because the probabillity to find successful mutations on each island is the same.
So after 1000 generations, it is very likely for each island to have gotten to almost the same fitness level.
Of course it is agreed that if you ask each node to run for 10 minutes *regardless* of when it finds a mutation, and you will only use the result of one node after 10 minutes (the best one), then no perf gain will be seen: it takes 10 mins to find the mutation you want to use and all the work of the other nodes was simply done in vain – but this is blindingly obvious and I did not think this alternative was even included in the discussion. :-) And I must admit I do not see how this interpretation ought to be implied from what you wrote to describe alternative 1).
I assumed alternative 1) represented the rather more useful approach of letting the nodes report back *as soon as they have an improved mutation* – in contrast to alternative 2) where the nodes (needlessly) report back after each generation.
You do agree that if the nodes report back as soon as they find an improved mutation (which on average takes 10 mins for a single machine) then for 2 machines we get (just a bit over) 5 mins and for 4 machines we get roughly 2 mins 30 secs before one of the machine reports back a happy mutation?
And of course, if you force the nodes to run for 10 minutes and each node on average doesn’t find just one, but 10 or 100, selected mutations in that time, it will of course still give no perf gain. But again, I assumed this would not be an alternative under discussion.
“My problem ATM is that I’m unaware of any other way than those I describe to run this in parallell w/o breaking the original concept.”
Since having the nodes report back as soon as they found a selected mutation was what I originally suggested…you *did* try that, right? ;-)
(Again, forcing the nodes to run for a set time obviously won’t work, and having them report back every generation is definitely needlessly chatty)
The problem is that about 1 out of 10 mutations are positive in EvoLisa even when the image have progressed quite far.
If you have two nodes, then you will find an improvement in 1 out of 5 _global_ generations on average.
(where global generation is when all the servers have run one generation, this is ofcourse not a very exact value, nodes might have different speeds etc)
if you have 5 nodes, you will find an improvement in 1 out of 2 global generations.
So even if you only report back when you find a successful mutation, the effect will be the same as if you report back after each generation once you have enough nodes.
That is, a new successful mutation will be reported every 1 or 2 generations and the new DNA will have to be distributed to all the nodes.
And that doesnt work since the overhead of passing the DNA around to each node is greater than to just run an new generation on a single machine.
“So even if you only report back when you find a successful mutation, the effect will be the same as if you report back after each generation once you have enough nodes.”
No, the effect will be way more chatty – something along the lines of 10 times more chatty. You will have 10 nodes reporting back per generation instead of one This means that you will hit diminishing returns due to network overhead much sooner – and that is in fact your real issue, rather than anything that has to do with probability theory.
For a node not to report back when it finds an improvment would be very negligent of it. To report back when it hasn’t found an improvment would be completely unecessary and very chatty. This is obvious and we don’t have to involve any probability theory.
Take alternative 1) again, with the (sensible) modification of nodes reporting back as they find improvments. You agree that one node takes 10 mins, 2 nodes take 5 mins and 4 servers takes 2.5 mins. Well, since the scalability of such a solution is so *good*, it will not take too long (not too many nodes) until you actually run into the point where the network overhead suddenly goes from completely negligeble (on one node / 10 mins) to substantial. 8 nodes gives 1.25 mins, 16 nodes gives 0.625 mins 32 nodes gives 0.3125 mins, 0.15625 mins, 64 nodes gives 0.078125 mins, 128 nodes gives 0.0390625 mins. That is, now the master should distribute improved DNA to 128 nodes once every 0.0390625 mins, which may well begin to be enough to impact the scalability.
So, even with a humongous job that takes 10 minutes to perform, you will pretty soon hit a wall of diminishing returns thanks to the network overhead, given a partitioning strategy with *near perfect* gains from parallelism!
Of course the same thing will happen with the EvoLisa jobs, and much sooner, since finding a selected mutation doesn’t take anywhere near 10 mins. (This also has very little to do with probability theory, btw. We could invoke some probability theory to show why two nodes will take slightly more than 5 mins to solve the job that one node did in 10 minutes, but we don’t need to dive into such detail to have this discussion.)
So (and this is what I am trying to say), the problem with alternative 1) – if interpreted as I propose, so each node reports back as soon as they find an improvment – has nothing to do with probability theory, so you probably shouldn’t assert that it does in the blog post ;-) The “problem” is in fact that it scales so *well* that it will hit diminishing returns due to network overhead surprisingly fast…
I see I slipped when typing the list of halving execution times for doubling nodes (I halved the times once without doubling the nodes). So, while it doesn’t affect the argument, in the interest of correctness: 64 nodes gives 0.15625 mins, 128 nodes gives 0.078125 mins, 256 nodes gives 0.0390625 mins. And again, it is really silly of me to include all the decimals (I just copy-pasted from Calculator) when in fact all these numbers will be slightly higher if we do take probability theory into account.
I think your problem is you aren’t running for long enough. :-)
I have one running at 1 good mutation per every few seconds, it’s a slowish machine but up to 5,000,000 generations.
Another possibility is if the system can be tuned to move in larger jumps, reduce the beneficial/generation rate but sustain the average drop in error per generation.
The only sticky point with communication is really ping. Datasize can be tiny, you can send a seed for the successful random sequence. You could probably gain on a LAN that way.
The other option is to go more complex, slow down so you can speed up again going parallel.
. Allow for effect polys (gausian blur, gradient, noise etc.)
. A better fitness function. YCrCb might be a boost. Going more human perceptual rather than direct error allows for many cpu eating options. That’s a field unto itself
. send everyone a different frame of video to do
. or go the whole hog and do it 3d. Try and evolve some video frames as slices though a mass of 3d solids, that might slow things down a tad.
No matter what numbers you throw at me will change the fact that this solution is slower than a single machine due to the amount of data passed around.
I have tried this and it just doesnt work.
the overhead of the communication is simply too big here.
“the overhead of the communication is simply too big here.”
Yes, I know – that’s exactly what I tried to explain that the problem is about, rather than anything having to do with probability theory.
“I have tried this and it just doesnt work.”
Well, did you actually ever try the one where nodes report back when they find a better mutation?
If you have only tried 1) running each node for a set period of time and 2) having all nodes report back each generation, you have only tried the two options that most dramatically demonstrate the drawback of each side in the trade-off (too chatty vs too quiet), and it is very unsurprising that this would give bad results. 1) simply isn’t going to work at all whereas 2) is way too chatty. Still, even nodes reporting back on finds may well turn out to be too chatty before you can scale out to the desired target number of nodes – all the same, it really has nothing to do with probability theory.
Also, like Neil says, I too get about one good mutation every few seconds after not too many generations, so it would seem like you could scale out to at least a few nodes before you saw dramatically reduced ROI due to network overhead becoming a significant part of the equation. Trying to parallellize the first lot of generations where mutations are eagerly selected is perhaps not as important?
Well, probability theory gives linear scalability with the number of nodes if you take action after each generation, but leaving the nodes running on their own gives no real scalability improvement at all over time given that the success rate in each generation is independent of the previous result.
In order to maintain scalability one needs to seed an equal amount of new nodes from each node every generation or reseed the nodes from the best individual node.
The first option is not really one and the second should be almost as impossible. Not only would it be a lot of communications overhead, but if the best choice is only detectable by letting each node run about 5000000 generations to se which one first hits an improvement towards the goal it really is impossible.
If you instead reseed the nodes each time one of them has a sucessful mutation the improvement depends on how the bell curve of the amount of generations it takes to find a successful mutation looks like. But however it looks there will be no linear improvement. Adding a few nodes will give some improvement (depending on how the bell curve looks) but adding more will give less and less improvement…
(My original idea was also “batch” but I obviously didn’t think about it properly)
The third variant is hard to comment on since it is more dependent on technical details. One thing is that the nodes need to run “in step” so that the comparison summary does not become to complex over time, how is that handled?. What about polygon clipping?
Doing away with the “root node” and letting every node broadcast to every other will increase bandwidth but give half the latency and which of those was the big problem in overhead? But this will obviously also increase the running “in step” problem.
“My current local version is reasonably fast, it does the same job in 7-8 minutes that the original code did in 3 hours.”
Is there any chance that you will share this updated code with the public?
Using the extra cores on dual core / quad core processors will give extra, low-latency nodes.