Saturday, September 28, 2013

Regression testing

Currently I rework my evaluation a bit. Nothing major.

I just add a simple pattern that might be missing or remove a pattern that might overlap with already existing ones.





This is usually a 5 minutes job of coding. The tricky part is to find out whether the change improves the engine or makes it weaker. As the changes are tiny the original and the patched engine are almost similar in strength, probably within a 5 ELO range.

The only way to get a feeling about the quality of the patch is to play an huge number of games.

I currently use the following test methodology.

  1. Code the change and test it for functional correctness. For evaluation I have a automated unit test that must be passed.
  2. Compile a release version of the engine.
  3. Setup a match using cutechess-cli as tournament manager. I use 4 parallel threads on my 4 core i7 and a huge balanced opening book where start postions are randomly selected. Each opening is played twice (each engine gets the chance to be white and black).
  4. I run 16.000 games and don't stop early.
  5. Feed the resulting match file into bayeselo to calculate the strength difference and statistical error bars.
  6. Draw the conclusion whether to keep or discard the change.
For most of the patches even 16.000 games are not enough to be sure about an improvement. After 16k games the error bars for both versions are -4 / +4. So in the worst case the stronger engine might be 4 ELO weaker than displayed, the weaker engine 4 ELO stronger than displayed. I need a 9 ELO difference to be sure the suspected stronger is really stronger.


If a patch shows 9 ELO or more improvement the decision is easy. In some case I might even keep a 5 ELO patch if it simplifies the evaluation. This is a bit risky of course.

On the other hand changing something in the evaluation brings the evaluation a bit out of its former balance. So a change that gives -2 ELO might still be very good if the balance of the evaluation is restored again.

With that reasoning I accept also some changes that are not an improvement beyond doubt. They might show their real potential if I re-tune the evaluation in a later phase.

When I'm done with my todo list of small changes I will run the final version vs the first version (iCE 1.0). Hopefully here I see then an improvement beyond doubt. Fingers crossed.



5 comments:

  1. I think you regression test methodology is right, but if you start doing modification in the search too I'd add some fixed position testing. I found that some modification to my code increase his strength but add some bug too and the bt2450.epd let me notice this bug.

    ReplyDelete
    Replies
    1. Hi Marco, thanks for the suggestion. I'll actually do that. In the development versions of iCE I compile the whole 1300 positions from the STS suite in to allow me to have a quick check whether something breaks with a change.

      It looks like this (picking 20 positions and search them for 1 sec)

      iCE 2.0 v322 x32 [2013.10.1]
      benchmark -cnt 20 -time 1000
      Benchmark run 4-10-2013-14-51 with engine iCE 2.0 v322 x32
      Running benchmark on 20 positions for 1.000 ms
      Estimated running time 00:00:20

      Pos EPD bm ice Y/N ms ply nodes pts score
      0 f4f5 f4f5 Yes 0 13 970 kN 10 cp 224
      1 a5a4 a5a4 Yes 0 14 900 kN 10 cp -12
      2 h6e3 h6e3 Yes 0 16 1.200 kN 10 cp 318
      3 a8d8 a8d8 Yes 0 12 950 kN 10 cp -35
      4 e6f4 e6f4 Yes 0 16 1.130 kN 10 cp 90
      5 a8d5 g6e6 No 983 13 1.010 kN 4 cp -13
      6 f6g5 f6g5 Yes 140 15 1.090 kN 10 cp 83
      7 d6f4 d6f4 Yes 63 12 840 kN 10 cp 133
      8 c5d4 e5d4 No 983 14 920 kN 4 cp 56
      9 d5b6 c6b6 No 998 14 980 kN 0 cp 137
      10 c3e4 c3e4 Yes 0 13 970 kN 10 cp 8
      11 h4h5 g1f1 No 983 11 800 kN 5 cp -10
      12 h3h4 d1g4 No 998 14 930 kN 5 cp 64
      13 c2c4 f5g6 No 983 10 770 kN 4 cp -46
      14 b8b2 b8a8 No 983 13 1.020 kN 2 cp 11
      15 g4f3 b4b3 No 982 14 910 kN 1 cp -52
      16 g1h2 g1h2 Yes 0 16 1.110 kN 10 cp -17
      17 g5e4 g5e4 Yes 0 13 910 kN 10 cp 110
      18 d3e4 d3e4 Yes 125 12 880 kN 10 cp 174
      19 e2e4 e2e4 Yes 62 14 960 kN 10 cp 207

      Summary:
      ------------------------------------------

      Benchmark run on 20 positions for 1.000 ms
      Total Benchmark Time 19 sec
      Solved positions : 12/20 (60%)
      Awarded points : 145/200 (72%)
      Good moves (1+ pts): 19/20 (95%)
      Avg. solution time : 0 sec
      Total Nodes : 19.250 kN
      Avg. speed : 975.476 nps
      Avg. reached ply : 13.45 ply

      Sub Totals per Module
      Modul 0 Solved : 2/2 (100%)
      Modul 1 Solved : 2/2 (100%)
      Modul 2 Solved : 1/2 (50%)
      Modul 3 Solved : 2/2 (100%)
      Modul 4 Solved : 0/2 (0%)
      Modul 5 Solved : 1/2 (50%)
      Modul 6 Solved : 0/2 (0%)
      Modul 7 Solved : 0/2 (0%)
      Modul 8 Solved : 2/2 (100%)
      Modul 9 Solved : 2/2 (100%)

      Delete
  2. can I suggest you to use a more predictable testset without random choice of position.
    p.s. I actually don't do what I'm suggesting, and I have just found a nasty bug ( still to be solved) added somewhere while adding feature

    ReplyDelete
    Replies
    1. Hi Marco, it is deterministic (I was unclear in my response, sorry). If I run the test on 20 positions it always runs on the same 20.

      step count = 1300 / test positions --> step count = 1300 / 20 = 65
      positions: 0, 65, 130, 195, 260 ... 1235 are selected and tested

      if I like to test 50 positions, it would be 0, 26, 52 ... 1274

      For a very quick test I usually take 20 positions, for an average test 100 and for something serious the whole 1300.

      Thomas...

      Delete