Saturday, September 28, 2013

Regression testing

Currently I rework my evaluation a bit. Nothing major.

I just add a simple pattern that might be missing or remove a pattern that might overlap with already existing ones.

This is usually a 5 minutes job of coding. The tricky part is to find out whether the change improves the engine or makes it weaker. As the changes are tiny the original and the patched engine are almost similar in strength, probably within a 5 ELO range.

The only way to get a feeling about the quality of the patch is to play an huge number of games.

I currently use the following test methodology.

  1. Code the change and test it for functional correctness. For evaluation I have a automated unit test that must be passed.
  2. Compile a release version of the engine.
  3. Setup a match using cutechess-cli as tournament manager. I use 4 parallel threads on my 4 core i7 and a huge balanced opening book where start postions are randomly selected. Each opening is played twice (each engine gets the chance to be white and black).
  4. I run 16.000 games and don't stop early.
  5. Feed the resulting match file into bayeselo to calculate the strength difference and statistical error bars.
  6. Draw the conclusion whether to keep or discard the change.
For most of the patches even 16.000 games are not enough to be sure about an improvement. After 16k games the error bars for both versions are -4 / +4. So in the worst case the stronger engine might be 4 ELO weaker than displayed, the weaker engine 4 ELO stronger than displayed. I need a 9 ELO difference to be sure the suspected stronger is really stronger.

If a patch shows 9 ELO or more improvement the decision is easy. In some case I might even keep a 5 ELO patch if it simplifies the evaluation. This is a bit risky of course.

On the other hand changing something in the evaluation brings the evaluation a bit out of its former balance. So a change that gives -2 ELO might still be very good if the balance of the evaluation is restored again.

With that reasoning I accept also some changes that are not an improvement beyond doubt. They might show their real potential if I re-tune the evaluation in a later phase.

When I'm done with my todo list of small changes I will run the final version vs the first version (iCE 1.0). Hopefully here I see then an improvement beyond doubt. Fingers crossed.