Saturday, September 28, 2013

Regression testing

Currently I rework my evaluation a bit. Nothing major.

I just add a simple pattern that might be missing or remove a pattern that might overlap with already existing ones.

This is usually a 5 minutes job of coding. The tricky part is to find out whether the change improves the engine or makes it weaker. As the changes are tiny the original and the patched engine are almost similar in strength, probably within a 5 ELO range.

The only way to get a feeling about the quality of the patch is to play an huge number of games.

I currently use the following test methodology.

  1. Code the change and test it for functional correctness. For evaluation I have a automated unit test that must be passed.
  2. Compile a release version of the engine.
  3. Setup a match using cutechess-cli as tournament manager. I use 4 parallel threads on my 4 core i7 and a huge balanced opening book where start postions are randomly selected. Each opening is played twice (each engine gets the chance to be white and black).
  4. I run 16.000 games and don't stop early.
  5. Feed the resulting match file into bayeselo to calculate the strength difference and statistical error bars.
  6. Draw the conclusion whether to keep or discard the change.
For most of the patches even 16.000 games are not enough to be sure about an improvement. After 16k games the error bars for both versions are -4 / +4. So in the worst case the stronger engine might be 4 ELO weaker than displayed, the weaker engine 4 ELO stronger than displayed. I need a 9 ELO difference to be sure the suspected stronger is really stronger.

If a patch shows 9 ELO or more improvement the decision is easy. In some case I might even keep a 5 ELO patch if it simplifies the evaluation. This is a bit risky of course.

On the other hand changing something in the evaluation brings the evaluation a bit out of its former balance. So a change that gives -2 ELO might still be very good if the balance of the evaluation is restored again.

With that reasoning I accept also some changes that are not an improvement beyond doubt. They might show their real potential if I re-tune the evaluation in a later phase.

When I'm done with my todo list of small changes I will run the final version vs the first version (iCE 1.0). Hopefully here I see then an improvement beyond doubt. Fingers crossed.

Tuesday, September 17, 2013

Collecting the Principle Variation (PV)

Since the early days of iCE I did not spent to much thought about collecting the principle variation (PV) for a given search. The PV represents the moves sequence from both sides in a given position.
The evaluation of the last position reached when the moves are played is the final resulting score of the just performed search.

My approach was just to collect the PV from the main transposition table. The way it works enables as a side effect to collect the PV by following the exact entries in it.

This is an easy and fast way without overhead. Unfortunately especially on long searches the PV is incomplete because
  • necessary entries were overwritten
  • the PV extends into the quiescence search (where I don't store hash entries)
This did not bother me to much. I considered displaying the PV a rather cosmetic thing anyway.

However for debugging reasons it is helpful to have the complete PV so it is possible to really track down the position whose eval is responsible for the final score. So recently I decided to change my implementation.

Now I collect the PV during search. This involves some storing and copying moves between arrays when a new best move is found.

// update the pv
pv[0] = ml->moves[i];
for (int j = 0; j < STATIC_QSEARCH_MAX_DEPTH; j++)

    // copy the moves from the subtree that this move generated to the pv until we hit an invalid move = end of list    
     if (INVALID_MOVE == (pv[j + 1] = subtree_pv[j])) break; 

I thought this to be really slow but the performance impact was almost not measurable.

Now iCE will display the correct and full PV much more often. There are still some rare cases left (Hash cutoffs in PV Nodes) where the PV is truncated. But this is rather rare.