I adjusted the scaling C constant in my fitness function
The best value for C to produce a minimal E given the evaluation function in iCE is -0.20. In my former runs I used a C of -0.58.
With this new function I repeated the tuning process following the Texel method with a slight modification. Instead of using a test set of 8 million positions I only used 7 million. With the remaining positions I performed a verification. I wanted to know whether the tuning produces an evaluation that only handles the positions used in the tuning well or whether the new evaluation also handles positions better it has not seen before.
Score E of the fitness function by generation - Lower E indicate higher fitness |
The upper lines show the development of E against the 7M positions used in the actual tuning. The horizontal line is the E that the current development version of iCE produces for this set.
The lower lines show the development of E against the verification set of positions.
It nicely shows that the evaluation is not over-fitted to the positions it sees but also translates well to other unrelated positions.
Finally I played a 16.000 games match between the new version and my previous one and it came out very close. The new one showed an improvement of 2 ELO, which is still within the error bar. But anyway this time the tuning only took 2 days. This is a huge speedup compared to the 4 weeks I spend with a game playing fitness function.
There is only a small thing that still annoys me. I have an earlier dev version of iCE that still scores 12 ELO better than the one produced by this tuning. This earlier version has the advantage that its search parameters were tuned to fit the evaluation weights. The Texel method does not take this into account. This might explain the difference.
I will address that when I now move my focus a bit away from eval and onto search. The framework is still the one from iCE 0.3 (just a bit better tuned). So its worth spending a bit time here too.