- How many pawns are on a file (double or triple pawn detection)
- How many squares can the queen attack next move
- How mobile are the bishops
- How many pawns are protecting the king in its king shelter
- How many squares has the knight to move that are not attacked by pawns of the enemy
Intel introduces with its Nehalem architecture in i7 cores a popcount processor instruction but the new CPUs are not so wide spread yet and I don't own one either so for the immediate future popcount still has to be calculated in software. So far I used a stable algorithm that uses shift and multiply operations to do that. This algorithm is here called C-shift (as it is my old shifting implementation in C)
int popCount_C_shift(int64 B)
{
B = B - ((B >> 1) & 0x5555555555555555ULL);
B = (B & 0x3333333333333333ULL) +
((B >> 2) & 0x3333333333333333ULL);
B = (B + (B >> 4)) & 0x0f0f0f0f0f0f0f0fULL;
return (B * 0x0101010101010101ull) >> 56;
}
{
B = B - ((B >> 1) & 0x5555555555555555ULL);
B = (B & 0x3333333333333333ULL) +
((B >> 2) & 0x3333333333333333ULL);
B = (B + (B >> 4)) & 0x0f0f0f0f0f0f0f0fULL;
return (B * 0x0101010101010101ull) >> 56;
}
int popCount_C_sparse(int64 b)
{
int count = 0;
while (b != 0)
{
count++;
b &= b - 1; // reset LS1B
}
return count;
}
And as 3rd alternative I decided to implement the above C algorithm directly in Assembler using 32 bit instructions to maybe reduce some overhead introduced by the compiler.
__asm
{
xor ecx,ecx
mov edx, dword ptr [b]
and edx, edx
jz L1
L0: inc ecx
lea eax, [edx-1]
and edx, eax
jnz L0
L1: mov edx, dword ptr [b + 04h]
and edx, edx
jz L3
L2: inc ecx
lea eax, [edx-1]
and edx, eax
jnz L2
L3: mov eax, ecx
}
{
xor ecx,ecx
mov edx, dword ptr [b]
and edx, edx
jz L1
L0: inc ecx
lea eax, [edx-1]
and edx, eax
jnz L0
L1: mov edx, dword ptr [b + 04h]
and edx, edx
jz L3
L2: inc ecx
lea eax, [edx-1]
and edx, eax
jnz L2
L3: mov eax, ecx
}
I compiled the code using the Microsoft C compiler, the Intel C Compiler and the gnu C compiler and let each of the algorithms count the 1 bits in a predefined large set of random numbers with an increasing population count and measured the execution time.
Here are the results
The old C shift based implementation is stable, it runs always the same time independent of the number of set bits. Both looping algorithms outperform it with lesser bits. The break even is with about 7 bits. The MSVC compiler code was slightly faster as my handcrafted assembler.
A similar picture is achieved using the Intel compiler
The shift C code is stable and runs slightly faster that the same algorithm compiled with MSVC. The looping algorithms outperform it but are a bit slower than the MSVC code. No profile guided optimization has been used (PGO). Maye that would level the execution speed compared with MSVC but was not tested.
Finally I compiled the source with the gnu C compiler using the gnu C builtin popcount instruction instead of my assembler code as the gnu C compiler does not understand inline assembler in Intel Syntax.
It shows that the native gcc popcount instruction is probably implemented as a shifting algorithm and surprisingly it is the worst performing of all 3. The looping algorithm gets a performance hit between 2 and 3 bits set but it still outperforms the other 2 until 6 bits are set.
So I took the architectural decision to drop my handcrafted assembler code and introduce the C version of the looping algorithm into my engine. I replace all popcount calls where the number of set bits is expected to be smaller than 7 (like number of pawns on the 7th rank) with the new popcount code and keep the old one for the remaining calls (e.g. number of squares a queen can attack).
I got a nice speedup of more than 10% in execution speed, which was well worth the exercise.