My reference engine to measure progress against is the latest development version of iCE whose weights have just been massively auto tuned by game play. It played more than 700k games in that process and it took almost 4 weeks. So it is rather well tuned.
For the new method I collected a large set of games from the internet (CCRL, CEGT and others) and serialized them into FEN positions from won, lost or drawn games from Whites point of view. I then filtered the positions. I removed positions from the opening, positions with a score >600 cp and non quiet positions. From the remaining FENs I selected a random sample of 8.000.000 positions.
The fitness of a certain engine is now the evaluation error against this set. The smaller the error the fitter a certain set.
E = 1/N * sum(i=1,N, (result[i] - Sigmoid(qScore(pos[i])))^2)
I used my usual GA framework to optimize the set with this fitness function and it converged rather nicely. The entropy dropped steadily.
Entropy of the system |
Also the evaluation error decreased with each additional generation, so the evolution produced fitter individuals each generation.
Evaluation error (fitness) - Lower values indicate a fitter population |
It took less than 100 generations to drive the error below the error of the reference engine.
I noticed that the optimization heavily favored 0s. In my former GA runs with game play the bit convergence towards 0 or 1 was about even. From my set of 515 bits half of the bits converged towards 1 and the other half towards 0.
In this case the evolution tried to eliminate evaluation features (setting the complete weight to 0), so the final evaluation is then dominated only by material and piece mobility.
Convergence: 79.35% (0.088%) Entropy: 57.93 (-0.175)
^
|█
155 +█
|█
|█ █
|█ █
|█▄▄ ▄ █
0 +-----+----+----+----+----+----+----+----+----+----+
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
This can be caused by weight redundancy, but it is unlikely that if two terms overlap one is set to 0 and the other to full power. Its more likely that each term gets part of the weight. Also the magnitude of weights being tossed out is suspicious. Maybe if material and mobility is even it is a good strategy to minimize the evaluation error by just betting on a draw as final game outcome. Eliminating the smaller terms does just that. It draws the score towards 0.
So I was curious how this version performs in a direct match against my reference engine. The bad news, it is a regression. The new version is about 20 ELO weaker than my reference engine. This is not so bad considering the fact that the terms were derived without a single played game and in only 48 hours. Against a not fully tuned engine the process would probably do reasonably well.
Rank Name Elo + - games score oppo. draws
1 iCE 2.0 v1543 x64/popcnt 11 5 5 14117 53% -11 31%
2 ice-g741 -11 5 5 14117 47% 11 31%
What worries me more is the trend in the process to eliminate the lesser important terms that contribute only small parts to the score. They are an essential part of the personality of the engine and they are in total worth at least 20 ELO, so they should stay.
I maybe re-run the experiment later with a different training set. It certainly has potential.
No comments:
Post a Comment