Saturday, April 13, 2013

Me vs. Evolution: 0 - 1

Now the results are in from some serious genetic evolutions. Here I used knockout tournaments between the individuals within one generation as fitness function.

This is fundamentally different than my former approach were I measured position solving performance
  • It is computationally much more expensive
  • It is not reproducible. If the best individual is determined in a generation chances are very high a different individual would be selected if the calculation would be repeated. This is caused by the high degree of randomness involved when playing small amount of games.  
  • There is no guarantee that the real best individual is winning a generation tournament. But chances are that at least a good one wins. 
As the computational effort is so big I only performed two runs so far. In one I played a fixed amount of games each generation. In the other I scaled the number of games up with each generation. As the algorithm converges the generated individuals are closer to each other so it might be a good idea to play more games to have a better chance to still spot the strongest one among them.

I call them evol-1 and evol-2.

                     Evol-1      Evol-2
Runtime in hrs          185         363
Generations           1.100       1.200
Total Games         184.800     362.345
EPDs solved*          2.646       2.652

*The number of solved positions out of a test set of 4.109 positions. This number is given to set this GA in relation to the previous test where the number of solved positions was the fitness criteria. Both solutions are better than the un-tuned version that scored only 2.437 points but worse than the version optimized towards solving this set that scored 2.811 points.

Entropy development of evol-1 and evol-2

Comparison of the ability to find correct positions in an EPD file
So from this data it looks like both versions performed very similar. The fact that version 2 played much more games does not really show up. One explanation could be that the error bar for decisions in both versions is huge, a bit smaller for evol-2 but still huge. So twice as much games doesn't already start to make a big difference.

And finally the real test: A direct round robin comparison between the two and the base version.

And the OSCAR goes to: evol-2

Rank Name         Elo    +    - games score oppo. draws
   1 ice.evol-2    89    5    5 12000   58%    35   32%
   2 ice.evol-1    69    5    5 12000   54%    44   33%
   3 ice.04         0    5    5 12000   39%    79   28%

Looks like all the effort finally paid off, considering also the fact that the base version is also not really a weak engine. It is a bit stronger than the offical iCE 0.3 which is rated 2475 ELO in the CCRL. 

Next I maybe tweak manually some of the weights from the final set because some look suspicious. I wonder whether I'm able to score half a point back against the evolution ...