Saturday, April 6, 2013

How fit is the fitness function

So far I tuned my engine by solving EPD positions. The final versions all solved more positions than the original one. But solving more positions is not the ultimate goal. Finally the engine shall play better chess.

To test that I ran a small tournament between four different versions of iCE.   

ice.04 is the base version that uses manual picked weights (reasonable ones but not tuned)
ice.4109 used a training set of 4109 filtered positions  
ice.175000 used a training set of 175000 filtered positions
ice.match plays games among the engines within a generation


Here are the results of a 6000 games round robin tournament.

Rank Name         Elo    +    - games score oppo. draws
   1 ice.match     52   11   11  3000   60%   -17   29%
   2 ice.04        14   11   11  3000   53%    -5   25%
   3 ice.4109      -9   11   11  3000   48%     3   27%
   4 ice.175000   -57   11   11  3000   39%    19   24%


Hmm !

Using an EPD solver as fitness function seems not the best choice. That is not a problem of the genetic algorithm itself. It solved what it was told to solve.

The final version of ice.4109 solved 2811 positions out of 4109. The base version of ice.04 only solved 2437 positions. The algorithm definitely found a better solution, unfortunately not for the problem I have.

On the other hand the more complex fitness function that plays games really seems to get the job done, even with a quick and dirty run that was only meant as reference in this case.

So either my approach with EPDs is not good, maybe a 2 ply search per position is not enough (in real games depending on time control iCE searches between 8 and 18 plies in the mid game) or the correlation between single position performance and game play is not strong enough.

It also shows that using more positions doesn't help. This is really surprising.

I still think the EPD approach has potential if it is maybe changed a bit. Things that come into mind here are
  • Using a higher search depth to solve each position in the hope to pickup more positional concepts in the then larger tree. But this slows down the algorithm so maybe the population size must be lowered then.   
  • Modify the fitness function. Currently I award 1 point for a position solved and 0 points otherwise. I could try to award more points for a quiet solution because they are harder to find.
  • Splitting the training data into different sets (instead of 1 file with 100k positions make it 50 files with 2k positions) and  give each generation a different file. This might prevent the algorithm to converge towards a very specific weight set to solve one file and leads to a more general solution instead, which is stronger in game play.
So this area is far from being fully exploited. For the moment I will put more focus on the tournament fitness to generate a stronger base version with a serious run. I will then add new terms to my evaluation which requires retuning anyway.

So more to come ...

Tuesday, April 2, 2013

Inside PBIL for chess eval tuning using EPD data

As a previous post triggered some interest how I used PBIL to auto tune my evaluation I now publish some more details about my approach.

First you have to decide about a fitness function that guides the evolution towards a good solution. The two major approaches I can think of are
  1. Let the engine search for a best moves in given positions. Engines finding more correct moves are fitter than others.
  2. Play games among the engines and their scoring determines the winner in a population
Both have their pros and cons. The 1st one is a bit easier to implement, faster in execution and it is a bit easier to implement multi-threaded.

The second is closer to the final goal, because at the end you want a strong playing engine and not one that solves a lot of positions, but there seems some correlation between both topics. So an engine that finds the correct move more often is also stronger when playing games. I have yet to find out how strong the correlation is.

When you start thinking about auto tuning with PBIL I suggest you at least start with the 1.st approach (as I did) because of its speed advantages. You can run more evolutions this will help you to get your PBIL framework stable and you can try different settings for the GA to find a suitable set for you (e.g. for Learning Rates, Elitism, Mutation ...).

But before you even start with chess tuning you should solve simpler problems. I used a small matrix problem for that.  

A B C
D E F
G H I

where possible values for A - I are 0 - 15. The best solution has the sum of rows and columns all as 24. A working framework should find a perfect solution in no time.

+-----------------------------------------------------------------------+
| Generation: 334 / 500       Population: 100         RunTime: 00:00:01 |
| Worker Threads : 1          Best Fitness: 144  (Age: 131)       0 sec |
+-----------------------------------------------------------------------+
| LR: 0.020/2  NLR: 0.000/1   MP/Drift: 0.000 / 0.050                   |
| LR: 0.020/2  NLR: 0.000/1   MP/Drift: 0.000 / 0.050                   |
| Elitism: No                                           Entrop:    0.01 |
+-----------------------------------------------------------------------+
| Fitness (-/o/+): 144/144/144  SD: 0              Convergence: 99.990% |
+-----------------------------------------------------------------------+

DNA:
Row1:  10   8   6
Row2:  11   4   9
Row3:   3  12   9

     PV Distribution of 36 bits   Fitness (mean/max): 144/144
     Convergence:  99.99% (0.000%)     Entropy:   0.01 (-0.000)
     ^
     |                                                 ▄
  20 +                                                 █
     |▄                                                █
     |█                                                █
     |█                                                █
     |█                                                █
   0 +-----+----+----+----+----+----+----+----+----+----+
      0   0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9   1


Then to the real stuff. When trying to tune an engine with chess positions the selection of "good" positions is very important. Just picking random positions from strong engine games at long time controls is probably not enough. I actually haven't even tried that because I feared that this is a waste of time. The fitness signal might be very weak. A position where the correct move requires a 20 ply search to be found is not very useful. Either no individual in a generation will find it within its search horizon so it does not help to determine the individual fitness or some engines will announce it as best for the wrong reasons.

For similar reasons also positions with tactical solutions (captures, promotions) are less useful because even engines that have the queen value badly wrong will probably go and capture the queen if possible. So almost all engines will solve that, which also does not help to differentiate them.

Therefore I started with positions from games of strong engines but filtered them. I gave them to Stockfish, Houdini and Kommodo and let those engines perform a very shallow search (depth 1or 2). Only if those strong engines were able to find the best move in I considered the position solvable.
  • Stockfish is the best (fastest) engine for that
  • Houdini 1.5 is a bit slower probably because of some overhead associated with starting a search
  • Kommodo is unstable (once in a while it does not come back with a best move after a "go depth 2" and requires an explicit "stop" command to stop searching)
With those positions the algorithm converged very nicely. I here display the performance of  an evolution with 2500 generations where I used 2000 filtered EPD positions as training data. I attach a link to the used training data in EPD format below in case someone finds it useful.

Probability Convergence of 507 bits towards 0 or 1


EPDs solved per generation in average and from the best individual (no elitism is used)

The Training Data:

2000 positions, verified with Houdini 1.5a

http://www.fam-petzke.de/downloads/TD-SUB-2000.epd

The EPD file also contains the score that Houdini awarded for that position, I saved it for maybe further use, but have not used it so far.

Saturday, March 30, 2013

Positional evaluation in chess

Chess engine are very good a tactics because it fits their way of approaching a position. To play moves that improve the position without gaining an immediate material advantage or even sacrifice material for a better position is much harder to program. It requires a careful coding of positional chess concepts and a fair amount of chess knowledge in its programmer.

My automated tuner seems to help here a bit. The tuned version is much more willing to give up some material to improve its position then the old "materialistic" version was.

In Jeremy Silman's "The Amateurs Mind" he discussed the following position

1r4k1/5p1p/1p3np1/pR6/P1Pr3P/2R2B2/1P4P1/7K w - -


Fischer - Taimanov, Palma de Mallorca 1970, White to Play
White has the queen side pawn majority and an active bishop targeting the queen side. The black knight wants to relocate to c5. Also Black threatens to take the white pawn at h4 with check.

I gave this position to my original version and it wasn't really understanding the position. At lower depths it tried to save the pawn with g3 and at higher depths it suggested 1. h5 Ne4 2. Bxe4 Rxe4 giving up its strong bishop and its advantage.

The tuned version very quickly found a better move. 1. c5! (1. .. Rxh4+ 2. Kg1 Rb4 3. Rb3 Rxb3 4. Rxb3 Nd7 5. c6). The pawn then queens after a rook sacrifice (Rxb6) which was a tactics my engine still missed so deep down the tree.

But it seems to make a little progress in the area of positional understanding.

Sunday, March 24, 2013

Time Trouble (again)
















During the tuning process of the engine I realized that some games that were played during the tuning are lost on time. The games are played very fast, e.g. 10 moves within 1 second, so an occasional loss on time is not uncommon. But after I measured it, the rate was between 15% and 20% time losses, which is much to high.

I thought my time management module was working correctly. It should adapt to fast time controls and prevent time losses, but it did not. I decided to have a look at it and spotted the problem almost instantly.
The second line was reading hardLimit = softLimit causing the trouble

The hard limit is the time that the search has to stop no matter what to prevent a time loss. The soft limit is a recommendation for the search but can be extended. There is a final check that the soft limit can never be bigger than the hard limit, but unfortunately I messed up the order. Instead of lowering the soft limit I extended the hard limit.

Stupid me. 
Fixed, tested and the time loss percentage dropped to 0%.

PS.: Another GA run is almost done, so I will soon have enough samples to check how much the whole stuff is worth in terms of ELO points. 

Saturday, March 16, 2013

Population Based Incremental Learning (PBIL)

As mentioned earlier I'm using a genetic algorithm to tune my chess evaluation module. I'm not using a classic GA where a good mother and a father is selected and their genes combined to produced the members of the next generation. I'm using a more modern approach called PBIL (Population Based Incremental Learning) and I even used it with some modifications. It is more similar now to "simulated annealing".

During my tests I made the following observations
  • The algorithm is able to converge the more important terms rather quickly
  • Terms with lesser influence converge very slowly, some none not at all given about 1000 generations
  • The choice of the fitness function is very important. Runs with some fast and simple functions failed badly.
  • The choice of the GA parameters is rather a second degree influence factor. All runs optimized the gene pool towards the fitness function very good. Some were just faster. But I haven't spent to much time trying to optimize these as a serious run with a complex fitness function takes about a week to complete. 
Here is the evolutionary progress of a run regarding the material weights. Already in generation 9 the values are very good, stable and close to their final values (except maybe for the queen weight).
Convergence data of the material values
Convergence of the Probability Vector after 9, 200, 400 and 1100 generations


























































As a measurement of increasing strength I checked the ability of the engines to find the best move in a set of positions. After less than 200 generations the engines reached the strength of my manual "tuned" engine. After 1000 generations the engines solved much more. Whether they also play better chess is yet to prove.




  

Saturday, March 9, 2013

Chess evaluation evolution

My little chess engine plays correct chess for some time now and also reached a decent strength level. So it is not yet among the worlds top 100 chess engines but the goal is in sight at least (place 117).

So far I haven't spent any time with tuning the evaluation. That does not mean its evaluation is simple. It is not, it possesses already a lot of chess knowledge but it is not able to use that knowledge to its full extend. Besides some basic material knowledge it also understands the concepts of mobility, center control, king safety, basic pawn structure properties, passed pawns or development.

So ice knows that having a doubled pawn is considered a liability and having control of the center is usually an advantage. So it will try to avoid unnecessarily doubling its pawns and will play moves that improve its influence in the center. But a game of chess consists usually of trade offs, unless your opponent makes a mistake you don't get anything for free. So it might be a good deal in a position to double the pawns if this also leads to more center control and a half open file for the rook, or it might still be bad deal.

Those things to decide are very difficult for an engine. A human will use its intuition for that which comes from experience, but an engine does not have this. It has to use a mathematical approach to decide whether some positional imbalances are good or bad for it.

And here tuning the evaluation features comes into play. With good tuned features an engine will be better in estimating whether the positional concessions in a line of play are worth it or whether this line should be avoided. Unfortunately picking the correct values is very tough and most of the values are connected somehow.

A simple example. Most chess players learn very early that a pawn is worth 1 point, knight and bishop 3, the rook 5 and the queen 9. Those are values that can be used as material values for the pieces. So now you add a bonus for the rook on an open file, this will increase the overall value of the rook, so to keep everything in balance its material value must be lowered. As my evaluation consists of more than a 100 terms that all are somehow connected the task of finding an optimal balanced setting lets the traveling sales man problem look like an easy job.

As I was always fascinated by genetic algorithms to solve optimization problems I decided to give it a try to optimize my evaluation. The general public opinion in the chess community is that it does not work. Of course this can also mean that it does work but the people that were successful with it don't talk about it. Having an effective and working automated evaluation tuner will give them a big competitive advantage against the manual weight pickers, so why give it away. I wanted to try it anyway because at least I learn something with it, even if it fails. And it is much more interesting than boring manual tuning.

I decided not to use some existing GA C libraries but to program the algorithm myself. Over the past month I dedicated a bit of my time to it and slowly an advanced tuning framework evolved. And my new i7 will provide the horse power to run it.

So far I did a few runs with different GA parameter settings and the results look promising. I might not be able to find the perfect weights but I'm surely able to find better weights than I would be able to find manually. So it is at least for me a limited form of success, and maybe more than that.

To find out I have to further test the evolved solutions with different parameters and fitness functions against the former version of ice and also against other engines. This will take some more time. Luckily not mine, everything runs automated and unattended, just my electricity bill will probably get a hit.

Wednesday, January 9, 2013

Multithreading in Windows

The main purpose of my little chess engine project is to grow my computer science skills by doing stuff I haven't done before. Chess programming is great for that because the such a project is small enough to be handled by a single person but complex enough to keep you busy for a long time.

It touches so many areas. You focus on complex high level algorithms and data structures but also on the assembler code that a certain compiler for a certain target platform creates to check whether your programs runs efficiently.    

Between Christmas and New Year I had a bit of time and decided to look into one area that always scared me so far.


Multi-Threading

It is not a completely new concept to iCE because it already dispatches a 2nd thread when started. But the only purpose of this thread is to listen for potential commands sent to the engine to keep the engine responsive even when doing some heavy calculations. The thread is very simple and most of the time just blocked because there is no input. I had to learn some nasty low level API calls to get a thread dispached like

 hThread = CreateThread( NULL, 0, (LPTHREAD_START_ROUTINE) TInputThread::execute, (LPVOID) this, 0, &dwThreadID );

but I got it working.

The real fun starts when multiple threads work on the same data to speed up the overall execution.I don't want to go there for my chess engine yet. It probably requires some significant design changes but for my tuning efforts I though it might be fun to build a multi-threaded tuning framework where different versions of my engine are tested in parallel.

A version of my engine is characterized by a set of evaluation parameter values. Due to my recent efforts those values can be changed anytime without recompiling the code. The tuning process might now build up a work queue of different parameter sets and test the performance of the engine with one of those sets in certain well defined tasks like solving mate puzzles or playing tons of games. This is a real CPU stress test and to fully utilize my new i7 running the whole stuff on a single core doesn't seem the smartest thing to do.  

So I had a reason to have a deeper look into multi-threading. I know how to start and control independent threads but I had no clue how difficult it is to synchronize threads. The usual way of doing  things just don't work. A simple

 if (myQueue.size() > 0) myQueue.pop();    // remove the first element from a queue

doesn't work any longer, because even if queue.size returns a value > 0 there is no guarantee that there is still an item in the queue if the execution reaches queue.pop. Another thread could have just removed the last item between the two calls.

In order to avoid such race conditions the sensitive parts must be protected with something called a critical section for threads or a mutex for process synchronization. There are tons of publications available how to do that and I had a good reading between the years.

One very clever technique I learned is called RAII (Resource Acquisition is Initialization). It basically means to wrap a C++ class around a critical section resource. The critical section is acquired with the construction of the class and freed when the class is destroyed. As in C++ an object is automatically destroyed when it leaves the scope the danger of not freeing a critical section (because of an exception within the code or a forgotten branch) is greatly reduced.

I now feel now a bit more comfortable about writing multi-threaded code. Maybe in a distant future iCE will use more that a 2nd little input handling thread.

We'll see.