Saturday, February 4, 2012

Speeding up the popcount stuff

A recent profile of my little engine revealed that one of the biggest single hotspots in the entire engine is the "popcount" instruction . This instruction returns the number of set "1" bits in a 64 bit integer. Especially the position evaluation is using this instruction a lot. Questions like
  • How many pawns are on a file (double or triple pawn detection)
  • How many squares can the queen attack next move
  • How mobile are the bishops
  • How many pawns are protecting the king in its king shelter
  • How many squares has the knight to move that are not attacked by pawns of the enemy
are all finally answered by counting the bits in a 64 bit integer. As iCE evaluates about 1 million positions per second this instruction is called many many million times. So even a tiny speed improvement can make a big difference.

Intel introduces with its Nehalem architecture in i7 cores a popcount processor instruction but the new CPUs are not so wide spread yet and I don't own one either so for the immediate future popcount still has to be calculated in software. So far I used a stable algorithm that uses shift and multiply operations to do that. This algorithm is here called C-shift (as it is my old shifting implementation in C)

int popCount_C_shift(int64 B)
{
        B = B - ((B >> 1) & 0x5555555555555555ULL);
        B = (B & 0x3333333333333333ULL) +
            ((B >> 2) & 0x3333333333333333ULL);
        B = (B + (B >> 4)) & 0x0f0f0f0f0f0f0f0fULL;
        return (B * 0x0101010101010101ull) >> 56;
}

A new approach implements a loop that deletes 1 bit at at time and counts how often it loops. This is very fast when only a few bits are set (C-sparse)

int popCount_C_sparse(int64 b)
{
   int count = 0;
   while (b != 0)
   {
       count++;
       b &= b - 1; // reset LS1B
   }
   return count;
}


And as 3rd alternative I decided to implement the above C algorithm directly in Assembler using 32 bit instructions to maybe reduce some overhead introduced by the compiler.

__asm
    {
            xor ecx,ecx
            mov edx, dword ptr [b]
            and edx, edx
            jz L1
       
        L0: inc ecx   
            lea eax, [edx-1]
            and edx, eax
            jnz L0
       
        L1: mov edx, dword ptr [b + 04h]
            and edx, edx
            jz L3

        L2: inc ecx   
            lea eax, [edx-1]
            and edx, eax
            jnz L2

        L3: mov eax, ecx
    } 

I compiled the code using the Microsoft C compiler, the Intel C Compiler and the gnu C compiler and let each of the algorithms count the 1 bits in a predefined large set of random numbers with an increasing population count and measured the execution time.

Here are the results














The old C shift based implementation is stable, it runs always the same time independent of the number of set bits. Both looping algorithms outperform it with lesser bits. The break even is with about 7 bits. The MSVC compiler code was slightly faster as my handcrafted assembler.

A similar picture is achieved using the Intel compiler














The shift C code is stable and runs slightly faster that the same algorithm compiled with MSVC. The looping algorithms outperform it but are a bit slower than the MSVC code. No profile guided optimization has been used (PGO). Maye that would level the execution speed compared with MSVC but was not tested.

Finally I compiled the source with the gnu C compiler using the gnu C builtin popcount instruction instead of my assembler code as the gnu C compiler does not understand inline assembler in Intel Syntax.












It shows that the native gcc popcount instruction is probably implemented as a shifting algorithm and surprisingly it is the worst performing of all 3. The looping algorithm gets a performance hit between 2 and 3 bits set but it still outperforms the other 2 until 6 bits are set.

So I took the architectural decision to drop my handcrafted assembler code and introduce the C version of the looping algorithm into my engine. I replace all popcount calls where the number of set bits is expected to be smaller than 7 (like number of pawns on the 7th rank) with the new popcount code and keep the old one for the remaining calls (e.g. number of squares a queen can attack).

I got a nice speedup of more than 10% in execution speed, which was well worth the exercise.