I just add a simple pattern that might be missing or remove a pattern that might overlap with already existing ones.
This is usually a 5 minutes job of coding. The tricky part is to find out whether the change improves the engine or makes it weaker. As the changes are tiny the original and the patched engine are almost similar in strength, probably within a 5 ELO range.
The only way to get a feeling about the quality of the patch is to play an huge number of games.
I currently use the following test methodology.
- Code the change and test it for functional correctness. For evaluation I have a automated unit test that must be passed.
- Compile a release version of the engine.
- Setup a match using cutechess-cli as tournament manager. I use 4 parallel threads on my 4 core i7 and a huge balanced opening book where start postions are randomly selected. Each opening is played twice (each engine gets the chance to be white and black).
- I run 16.000 games and don't stop early.
- Feed the resulting match file into bayeselo to calculate the strength difference and statistical error bars.
- Draw the conclusion whether to keep or discard the change.
If a patch shows 9 ELO or more improvement the decision is easy. In some case I might even keep a 5 ELO patch if it simplifies the evaluation. This is a bit risky of course.
On the other hand changing something in the evaluation brings the evaluation a bit out of its former balance. So a change that gives -2 ELO might still be very good if the balance of the evaluation is restored again.
With that reasoning I accept also some changes that are not an improvement beyond doubt. They might show their real potential if I re-tune the evaluation in a later phase.
When I'm done with my todo list of small changes I will run the final version vs the first version (iCE 1.0). Hopefully here I see then an improvement beyond doubt. Fingers crossed.
I think you regression test methodology is right, but if you start doing modification in the search too I'd add some fixed position testing. I found that some modification to my code increase his strength but add some bug too and the bt2450.epd let me notice this bug.
ReplyDeleteHi Marco, thanks for the suggestion. I'll actually do that. In the development versions of iCE I compile the whole 1300 positions from the STS suite in to allow me to have a quick check whether something breaks with a change.
DeleteIt looks like this (picking 20 positions and search them for 1 sec)
iCE 2.0 v322 x32 [2013.10.1]
benchmark -cnt 20 -time 1000
Benchmark run 4-10-2013-14-51 with engine iCE 2.0 v322 x32
Running benchmark on 20 positions for 1.000 ms
Estimated running time 00:00:20
Pos EPD bm ice Y/N ms ply nodes pts score
0 f4f5 f4f5 Yes 0 13 970 kN 10 cp 224
1 a5a4 a5a4 Yes 0 14 900 kN 10 cp -12
2 h6e3 h6e3 Yes 0 16 1.200 kN 10 cp 318
3 a8d8 a8d8 Yes 0 12 950 kN 10 cp -35
4 e6f4 e6f4 Yes 0 16 1.130 kN 10 cp 90
5 a8d5 g6e6 No 983 13 1.010 kN 4 cp -13
6 f6g5 f6g5 Yes 140 15 1.090 kN 10 cp 83
7 d6f4 d6f4 Yes 63 12 840 kN 10 cp 133
8 c5d4 e5d4 No 983 14 920 kN 4 cp 56
9 d5b6 c6b6 No 998 14 980 kN 0 cp 137
10 c3e4 c3e4 Yes 0 13 970 kN 10 cp 8
11 h4h5 g1f1 No 983 11 800 kN 5 cp -10
12 h3h4 d1g4 No 998 14 930 kN 5 cp 64
13 c2c4 f5g6 No 983 10 770 kN 4 cp -46
14 b8b2 b8a8 No 983 13 1.020 kN 2 cp 11
15 g4f3 b4b3 No 982 14 910 kN 1 cp -52
16 g1h2 g1h2 Yes 0 16 1.110 kN 10 cp -17
17 g5e4 g5e4 Yes 0 13 910 kN 10 cp 110
18 d3e4 d3e4 Yes 125 12 880 kN 10 cp 174
19 e2e4 e2e4 Yes 62 14 960 kN 10 cp 207
Summary:
------------------------------------------
Benchmark run on 20 positions for 1.000 ms
Total Benchmark Time 19 sec
Solved positions : 12/20 (60%)
Awarded points : 145/200 (72%)
Good moves (1+ pts): 19/20 (95%)
Avg. solution time : 0 sec
Total Nodes : 19.250 kN
Avg. speed : 975.476 nps
Avg. reached ply : 13.45 ply
Sub Totals per Module
Modul 0 Solved : 2/2 (100%)
Modul 1 Solved : 2/2 (100%)
Modul 2 Solved : 1/2 (50%)
Modul 3 Solved : 2/2 (100%)
Modul 4 Solved : 0/2 (0%)
Modul 5 Solved : 1/2 (50%)
Modul 6 Solved : 0/2 (0%)
Modul 7 Solved : 0/2 (0%)
Modul 8 Solved : 2/2 (100%)
Modul 9 Solved : 2/2 (100%)
can I suggest you to use a more predictable testset without random choice of position.
ReplyDeletep.s. I actually don't do what I'm suggesting, and I have just found a nasty bug ( still to be solved) added somewhere while adding feature
Hi Marco, it is deterministic (I was unclear in my response, sorry). If I run the test on 20 positions it always runs on the same 20.
Deletestep count = 1300 / test positions --> step count = 1300 / 20 = 65
positions: 0, 65, 130, 195, 260 ... 1235 are selected and tested
if I like to test 50 positions, it would be 0, 26, 52 ... 1274
For a very quick test I usually take 20 positions, for an average test 100 and for something serious the whole 1300.
Thomas...
ok, I didn't understood :-)
ReplyDelete