Thoughts about bit-representation of particle configurations #1

GaffaSnobb · 2024-01-16T01:44:40Z

GaffaSnobb
Jan 16, 2024
Maintainer

Sub-state index representation

In second quantisation we describe a quantum mechanical system in terms of occupation numbers, saying how many particles are in each state rather than specifying the state of each particle. This is a more compact and often more manageable way to describe many-body systems and it aligns well with how a computer represents data. Let's say that we have two particles in a system and the particles are occupying two specific sub-states in the system which we'll label 3 and 5. In second quantisation we represent the wavefunction of such a system by using creation and annihilation operators:

$$| \Psi \rangle = \hat{c}_3^\dagger \hat{c}_5^\dagger | \text{vac} \rangle.$$

If we want to represent $| \Psi \rangle$ on a computer we might store only which sub-states are occupied, like

psi = [3, 5]

This is currently how kshell-cpp stores particle configurations. While this is verbose, I think that it is quite expensive to insert and delete elements from a std::vector though having the benefit of being a very intuitive way of implementing creation and annihilation operators. Additionally, it is important to know the order of the particles. While the initial choice of particle placement is arbitrary, changing the order of the two operators will introduce a sign shift

$$\hat{c}_3^\dagger \hat{c}_5^\dagger \rightarrow - \hat{c}_5^\dagger \hat{c}_3^\dagger.$$

This is because of the anti-commutation relations of the creation and annihilation operators

$$\{ \hat{c}_a, \hat{c}_b^\dagger \} = \hat{c}_a \hat{c}_b^\dagger + \hat{c}_b^\dagger \hat{c}_a = \delta_{ab},$$

$$\{ \hat{c}_a, \hat{c}_b \} = \hat{c}_a \hat{c}_b + \hat{c}_b \hat{c}_a = 0,$$

$$\{ \hat{c}_a^\dagger, \hat{c}_b^\dagger \} = \hat{c}_a^\dagger \hat{c}_b^\dagger + \hat{c}_b^\dagger \hat{c}_a^\dagger = 0,$$

where the last of the three is the relevant one in this case. This means that every time we want to remove or add a particle sub-state in the psi list we have to consider the order of which the particles are put into the state (the order of the creation operators). The convention I have chosen is to put the particles into the state in order of increasing sub-state index, building all the basis states this way, and choosing positive sign. If we choose to add a particle in sub-state 2 the operation is trivial

$$| \Psi \rangle = \hat{c}_3^\dagger \hat{c}_5^\dagger | \text{vac} \rangle \rightarrow [3, 5],$$

$$| \Psi' \rangle = \hat{c}_2^\dagger \hat{c}_3^\dagger \hat{c}_5^\dagger | \text{vac} \rangle \rightarrow [2, 3, 5].$$

However, if we add a particle to sub-state 4

$$| \Psi' \rangle = \hat{c}_4^\dagger \hat{c}_3^\dagger \hat{c}_5^\dagger | \text{vac} \rangle = - \hat{c}_3^\dagger \hat{c}_4^\dagger \hat{c}_5^\dagger | \text{vac} \rangle \rightarrow [4, 3, 5] = -[3, 4, 5].$$

Bit-representation

Since the occupation of each sub-state only requires a single bit per sub-state (occupied or not), we might look to other more memory efficient and more performant ways of storing the wave functions. A "bit-like" representation might look like

psi = [0, 0, 0, 1, 0, 1, ...]

Here, a 0 at position $i$ means that sub-state $i$ is un-occupied while a 1 means that it is occupied. Of course, storing each sub-state as an individual integer in some list / vector data structure is unnecessary since we only need to differentiate between occupied and un-occupied. We might therefore use bit-manipulation on an appropriately sized primitive data type, for example an int, or we might use C++'s std::bitset. If using a primitive data type, we are limited to the bit size of said data type. Consider a large model space of the $sd$, $pf$, and $sdg$ major orbitals. There are a total of 62 distinct sub-states (all the $j$ projections for all of the orbitals combined) for both protons and neutrons, totalling 124 sub-states. Meaning that a 128 bit integer is appropriate.

For the following example, let us reduce to 16 possible sub-states represented by an unsigned short in C++. It is easy to make a simple framework for applying creation operators to the empty (vacuum) state

void print_bit_representation(unsigned short state)
{
    cout << "value: " << state << endl;
    std::bitset<16> bits(state);
    cout << "bits: ";
    for (int i = 15; i >= 0; --i)
    {
        cout << bits[i] << " ";
    }
    cout << endl;
}

void apply_creation_operators(const std::vector<int>& creation_operators, unsigned short& state)
{
    for (int N : creation_operators) state ^= 1U << N;
    print_bit_representation(state);
}

We now see that our previous basis state [3, 5] can be represented as the number 40:

unsigned short state = 0;
std::vector<int> creation_operators = {3, 5};
apply_creation_operators(creation_operators, state);

value: 40
bits: 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0

Adding or removing particles (applying creation and annihilation operators) is as easy as flipping the correct bit. However, we need to be mindful of the order of operation because we might create a sign in the process.

According to the good ole' Internet, std::bitset is more user-friendly and equally fast as bit-flipping primitive data types if used correctly. I will therefore start with std::bitset and possibly benchmarking it against using a primitive data type later.

Speed difference

Perhaps not surprisingly, using a bit-representation with std::bitset and flipping bits is a LOT faster than inserting and removing from a std::vector. To apply creation and annihilation operators, we simply need to flip a bit instead of resizing a vector. My current implementation is as follows:

inline unsigned short reset_bit_and_count_swaps(std::bitset<n_bits_bitset>& state, unsigned short bit_to_reset)
{
    /*
    Reset bit number `bit_to_reset` and count how many bits before
    `bit_to_reset` are set / count how many operator swaps must be
    performed to place the annihilation operator correctly.
    */
    unsigned short count = 0;
    // for (unsigned short i = 0; i < bit_to_reset; i++) if (state[i]) count++;
    for (unsigned short i = 0; i < bit_to_reset; i++) count += state[i];
    state[bit_to_reset] = 0;
    return count;
}

Note that I am using state[bit_to_reset] = 0 instead of state.reset(bit_to_reset) because the latter checks if the index is out-of-bounds and thus has a slower execution time. The same can be said for using state[i] instead of state.test(i). As you might also see from the code example above, theres a choice of using either if (state[i]) count++ or count += state[i] which has the same end result since the values are either 1 or 0. Using the if-statement reduces the execution time slightly, but creating a branch might prove a disadvantage when I soon start testing GPU acceleration of the code.

The above function does the job of applying an annihilation operator which is the same as setting the correct bit to zero aka. resetting the bit, but it also has to count how many bits are set before the bit which will be reset. This is because the annihilation operator has to be moved so that it is applied to the creation operator of the same bit, and changing positions of operators creates signs which must be accounted for. Since the sign is always calculated by taking -1 to the power of a positive integer, we might get some speed-up by using the fact that -1 to an even power is 1, and to an odd power is -1:

inline short negative_one_pow(short exponent)
{
    /*
    Calculates -1 raised to the power of `exponent` efficiently using
    bitwise operations. This function  uses the property that -1 to an
    even power is 1, and to an odd power is -1. It checks the least
    significant bit of `exponent` to determine if `exponent` is odd or
    even.
    */
    // return std::pow(-1, exponent);
    return (exponent & 1) ? -1 : 1;
    // return (exponent % 2 == 0) ? 1 : -1;
}

As you see, there are three ways of calculating this exponent: Using std::pow, checking the most significant bit to see if the exponent is odd or even, or using the modulo operator to check if the exponent is divisible by two. With -O0, blasting the three methods with some millions of calculations gave these times (19.54375, 19.70725, 19.9485) s respectively which was a bit surprising to me since I expected the very generalised std::pow to perform slightly worse than a tailored solution but it turns out that it performs almost identical to the other implementations. The same test but with -Ofast gave (1.542, 1.4765, 1.475) s which gave almost identical results too, but just maybe the tiniest little advantage to the two tailored solutions. I will test this more on GPU.

Using a primitive data type instead of `std::bitset`

So, std::bitset has some conveniences, like bitset.set, bitset.reset, bitset.test, etc. But since these methods are doing boundary checks, they turn out to be a bit slower than just indexing the bitset. And if I'm not using the conveniences of a bitset, I might check out a primitive data type instead.

An unsigned long long has 64 bits (I dont know if the unsigned part is important but thats what I'm using). The sd-pf-sdg model space has 62 m-substates, so it seems sufficient to use (one for protons, one for neutrons). Instead of iterating to the bit I want to set (or reset), I can create a bitmask and set the bit I want to:

inline void set_bit(unsigned long long &state, unsigned short bit_to_set)
{
    /*
    Sets a specific bit in an unsigned long long number.
    */
    state |= (1ULL << bit_to_set);
}

If I want to set a bit and count how many bits before that bit is set, which amounts to ordering the creation operators before letting them act, I can do

inline unsigned short set_bit_and_count_swaps(unsigned long long& state, unsigned short bit_to_set)
{
    unsigned long long mask = (1ULL << bit_to_set) - 1; // Create a mask for all bits before the bit_to_set.
    int count = __builtin_popcountll(state & mask);     // Count set bits before the bit_to_set.
    state |= (1ULL << bit_to_set);                      // Set the bit at bit_to_set.
    return count;
}

while with the std::bitset I would do

unsigned short set_bit_and_count_swaps(std::bitset<64>& state, unsigned short bit_to_set)
{
    unsigned short count = 0;
    for (unsigned short i = 0; i < bit_to_set; i++) count += state[i];
    state[bit_to_set] = 1;
    return count;
}

The assembly for these two solutions are quite different. For the unsigned long long solution we have

set_bit_and_count_swaps(unsigned long long&, unsigned short):
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32
        mov     QWORD PTR [rbp-24], rdi
        mov     eax, esi
        mov     WORD PTR [rbp-28], ax
        movzx   eax, WORD PTR [rbp-28]
        mov     edx, 1
        mov     ecx, eax
        sal     rdx, cl
        mov     rax, rdx
        sub     rax, 1
        mov     QWORD PTR [rbp-8], rax
        mov     rax, QWORD PTR [rbp-24]
        mov     rax, QWORD PTR [rax]
        and     rax, QWORD PTR [rbp-8]
        mov     rdi, rax
        call    __popcountdi2
        mov     DWORD PTR [rbp-12], eax
        mov     rax, QWORD PTR [rbp-24]
        mov     rdx, QWORD PTR [rax]
        movzx   eax, WORD PTR [rbp-28]
        mov     esi, 1
        mov     ecx, eax
        sal     rsi, cl
        mov     rax, rsi
        or      rdx, rax
        mov     rax, QWORD PTR [rbp-24]
        mov     QWORD PTR [rax], rdx
        mov     eax, DWORD PTR [rbp-12]
        leave
        ret

and for the std::bitset solution we have

set_bit_and_count_swaps(std::bitset<64ul>&, unsigned short):
        push    rbp
        mov     rbp, rsp
        sub     rsp, 64
        mov     QWORD PTR [rbp-56], rdi
        mov     eax, esi
        mov     WORD PTR [rbp-60], ax
        mov     WORD PTR [rbp-2], 0
        mov     WORD PTR [rbp-4], 0
        jmp     .L8
.L9:
        movzx   edx, WORD PTR [rbp-4]
        lea     rax, [rbp-48]
        mov     rcx, QWORD PTR [rbp-56]
        mov     rsi, rcx
        mov     rdi, rax
        call    std::bitset<64ul>::operator[](unsigned long)
        lea     rax, [rbp-48]
        mov     rdi, rax
        call    std::bitset<64ul>::reference::operator bool() const
        movzx   eax, al
        add     WORD PTR [rbp-2], ax
        lea     rax, [rbp-48]
        mov     rdi, rax
        call    std::bitset<64ul>::reference::~reference() [complete object destructor]
        movzx   eax, WORD PTR [rbp-4]
        add     eax, 1
        mov     WORD PTR [rbp-4], ax
.L8:
        movzx   eax, WORD PTR [rbp-4]
        cmp     ax, WORD PTR [rbp-60]
        jb      .L9
        movzx   edx, WORD PTR [rbp-60]
        lea     rax, [rbp-32]
        mov     rcx, QWORD PTR [rbp-56]
        mov     rsi, rcx
        mov     rdi, rax
        call    std::bitset<64ul>::operator[](unsigned long)
        lea     rax, [rbp-32]
        mov     esi, 1
        mov     rdi, rax
        call    std::bitset<64ul>::reference::operator=(bool)
        lea     rax, [rbp-32]
        mov     rdi, rax
        call    std::bitset<64ul>::reference::~reference() [complete object destructor]
        movzx   eax, WORD PTR [rbp-2]
        leave
        ret

Now, I'm not gonna pretend to be an expert on assembly code – because I'm not – but we can see that the two solutions produce more or less the same assembly for the first six lines, except that the bitset solution reserves 64 bits on the stack while the primitive solution reserves 32. But more relevant is that the bitset solution creates a loop in the assembly code for counting the bits, while the primitive solution uses sal for shift, and for masking, or for setting a bit, and the __builtin_popcountll, which likely directly translates to a highly optimized CPU instruction for counting set bits. Overall, there are fewer assembly instructions in the primitive solution and some benchmarking shows that it is approximately 30% faster than the bitset solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts about bit-representation of particle configurations #1

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Thoughts about bit-representation of particle configurations #1

Uh oh!

Uh oh!

GaffaSnobb Jan 16, 2024 Maintainer

Sub-state index representation

Bit-representation

Speed difference

Using a primitive data type instead of std::bitset

Replies: 0 comments

GaffaSnobb
Jan 16, 2024
Maintainer

Using a primitive data type instead of `std::bitset`