Saturday, March 22, 2014

Some more tuning results

I just realized that I had some flaws in my tuning setup using the Texel method. A +2 score in iCE can be compared to a +1 score in other engines. I must take this difference into account in order for the tuning to work.

I adjusted the scaling C constant in my fitness function

The best value for C to produce a minimal E given the evaluation function in iCE is -0.20. In my former runs I used a C of -0.58.

With this new function I repeated the tuning process following the Texel method with a slight modification. Instead of using a test set of 8 million positions I only used 7 million. With the remaining positions I performed a verification. I wanted to know whether the tuning produces an evaluation that only handles the positions used in the tuning well or whether the new evaluation also handles positions better it has not seen before.

Score E of the fitness function by generation - Lower E indicate higher fitness

The upper lines show the development of E against the 7M positions used in the actual tuning. The horizontal line is the E that the current development version of iCE produces for this set.

The lower lines show the development of E against the verification set of positions.

It nicely shows that the evaluation is not over-fitted to the positions it sees but also translates well to other unrelated positions.    

Finally I played a 16.000 games match between the new version and my previous one and it came out very close. The new one showed an improvement of 2 ELO, which is still within the error bar. But anyway this time the tuning only took 2 days. This is a huge speedup compared to the 4 weeks I spend with a game playing fitness function.

There is only a small thing that still annoys me. I have an earlier dev version of iCE that still scores 12 ELO better than the one produced by this tuning. This earlier version has the advantage that its search parameters were tuned to fit the evaluation weights. The Texel method does not take this into account. This might explain the difference.

I will address that when I now move my focus a bit away from eval and onto search. The framework is still the one from iCE 0.3 (just a bit better tuned). So its worth spending a bit time here too.

Sunday, March 16, 2014

Pawn Advantage in iCE

The Texel tuning method tries to minimize the evaluation error for a set of positions. The error is minimal if the score of a position is a good prediction of the outcome of the game the position is from.

There is a formula in the chess programming wiki that helps to translate the score (usually given in fractions of a pawn) into a winning probability. Plotted as graph it looks similar to the one below.
A score of +1 relates to a winning percentage of 79% and +2 to 93%.

I verified whether this distribution applies to the evaluation of iCE and found out that it does not. I used several sets of a million positions each and calculated the function that fits the winning percentage distribution best. The calculated optimal function coefficients were different depending on the set of test positions but when plotted as graph the all were almost identical.

The blue line is the original graph from above, the three others represent the winning probability by score in iCE for three different sets of positions. 

A +2 score only indicates a 72% winning percentage.

I should take that into account in future tuning excercises.

Monday, March 10, 2014

The texel way of tuning

In a recent discussion at Talkchess Peter Ă–sterlund, the author of the chess engine "Texel", explained his way of auto tuning his engine. He was able to improve its engine significantly with this method. I thought it is worth giving it a try with my engine too.

My reference engine to measure progress against is the latest development version of iCE whose weights have just been massively auto tuned by game play. It played more than 700k games in that process and it took almost 4 weeks. So it is rather well tuned.

For the new method I collected a large set of games from the internet (CCRL, CEGT and others) and serialized them into FEN positions from won, lost or drawn games from Whites point of view. I then filtered the positions. I removed positions from the opening, positions with a score >600 cp and non quiet positions. From the remaining FENs I selected a random sample of 8.000.000 positions.

The fitness of a certain engine is now the evaluation error against this set. The smaller the error the fitter a certain set.

E = 1/N * sum(i=1,N, (result[i] - Sigmoid(qScore(pos[i])))^2)

I used my usual GA framework to optimize the set with this fitness function and it converged rather nicely. The entropy dropped steadily.

Entropy of the system

Also the evaluation error decreased with each additional generation, so the evolution produced fitter individuals each generation.

Evaluation error (fitness) - Lower values indicate a fitter population

It took less than 100 generations to drive the error below the error of the reference engine.

I noticed that the optimization heavily favored 0s. In my former GA runs with game play the bit convergence towards 0 or 1 was about even. From my set of 515 bits half of the bits converged towards 1 and the other half towards 0.

In this case the evolution tried to eliminate evaluation features (setting the complete weight to 0), so the final evaluation is then dominated only by material and piece mobility. 

     Convergence:  79.35% (0.088%)     Entropy:  57.93 (-0.175)
 155 +█
     |█                                                █
     |█                                                █
     |█▄▄                                         ▄    █
   0 +-----+----+----+----+----+----+----+----+----+----+
      0   0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9   1

This can be caused by weight redundancy, but it is unlikely that if two terms overlap one is set to 0 and the other to full power. Its more likely that each term gets part of the weight. Also the magnitude of weights being tossed out is suspicious. Maybe if material and mobility is even it is a good strategy to minimize the evaluation error by just betting on a draw as final game outcome. Eliminating the smaller terms does just that. It draws the score towards 0.

So I was curious how this version performs in a direct match against my reference engine. The bad news, it is a regression. The new version is about 20 ELO weaker than my reference engine. This is not so bad considering the fact that the terms were derived without a single played game and in only 48 hours. Against a not fully tuned engine the process would probably do reasonably well.

Rank Name                       Elo    +    - games score oppo. draws
   1 iCE 2.0 v1543 x64/popcnt    11    5    5 14117   53%   -11   31%
   2 ice-g741                   -11    5    5 14117   47%    11   31%

What worries me more is the trend in the process to eliminate the lesser important terms that contribute only small parts to the score. They are an essential part of the personality of the engine and they are in total worth at least 20 ELO, so they should stay.

I maybe re-run the experiment later with a different training set. It certainly has potential.

Sunday, March 2, 2014

Tablebase generation on the fly - Part2

I worked a bit more on the generation algorithm and was able to speed it up a bit more. The generation time for the KPK table was reduced from 171 ms to 31 ms. This is now fast enough to generate all 3men tables at engine startup.

I integrated the code into iCE and removed the built-in tables. On my i7 the complete generation takes 78 ms, which I consider acceptable. On a slower system and as as 32 bit code it takes an also acceptable 125 ms.

Surprisingly the difference between the debug and the release version of the code is about factor 40. I cannot really explain why it is so big.

TableBase: KPK   Generation Time:  0.031 sec   Iterations: 29

STM           WHITE     BLACK                    WHITE     BLACK
ILLEGAL      49.279    32.944      UNKNOWN           0         0
MATED             0        42      DRAWS        19.184    36.206

MATE IN   1      40         0      MATED IN  1       0        64
MATE IN   2      97         0      MATED IN  2       0       163
MATE IN   3     219         0      MATED IN  3       0       406
MATE IN   4     422         0      MATED IN  4       0       821
MATE IN   5     915         0      MATED IN  5       0     1.698
MATE IN   6   1.636         0      MATED IN  6       0     3.124
MATE IN   7   3.121         0      MATED IN  7       0     5.841
MATE IN   8   5.647         0      MATED IN  8       0     7.603
MATE IN   9   7.541         0      MATED IN  9       0     8.110
MATE IN  10   8.395         0      MATED IN 10       0     7.857
MATE IN  11   8.601         0      MATED IN 11       0     7.215
MATE IN  12   8.178         0      MATED IN 12       0     5.843
MATE IN  13   6.719         0      MATED IN 13       0     4.185
MATE IN  14   3.829         0      MATED IN 14       0     2.501
MATE IN  15   1.065         0      MATED IN 15       0     1.026
MATE IN  16   1.154         0      MATED IN 16       0     1.194
MATE IN  17   1.066         0      MATED IN 17       0       902
MATE IN  18     871         0      MATED IN 18       0       711
MATE IN  19     658         0      MATED IN 19       0       597
MATE IN  20     558         0      MATED IN 20       0       436
MATE IN  21     606         0      MATED IN 21       0       565
MATE IN  22     562         0      MATED IN 22       0       430
MATE IN  23     343         0      MATED IN 23       0       292
MATE IN  24     144         0      MATED IN 24       0       109
MATE IN  25      64         0      MATED IN 25       0        31
MATE IN  26      19         0      MATED IN 26       0        14
MATE IN  27       7         0      MATED IN 27       0         4
MATE IN  28       3         0      MATED IN 28       0         2