Constant and global device arrays and scopes #6

GaffaSnobb · 2024-05-10T12:03:07Z

GaffaSnobb
May 10, 2024
Maintainer

In my shell model solver, there are a lot of arrays whose values are calculated in the initial stage of the program and stay constant for the entire rest of the program run time. These values sound suited for __constant__ device arrays. Now, the initial problem is that the size of the __constant__ arrays must be known at compile-time which is problematic since the length of the arrays will depend on parameters of the calculation, like the model space of the interaction. For many of the arrays in question, this issue can be somewhat mitigated by the fact that there is a relatively small upper limit to the sizes of the arrays. A solution is then to allocate the arrays to accommodate the largest possible size, the sizes are as of now (parameters.hpp)

CONST_MEM_ARR_LEN_INDICES = 5688
CONST_MEM_ARR_LEN_N_ORBITALS = 12*2

The number of orbitals correspond to the sdpf-sdg model space which has 12 proton orbitals and 12 neutron orbitals.

Now to the programming part; initially I tried to define the __constant__ device arrays in a separate header to make them available across translation units, declaring them in some common header file like

extern __constant__ dtype_t arr[LEN];

With CUDA I would then use the -rdc=true flag to allow relocatable device code, but I have not managed to make this work with HIP. I have therefore decided to let the __constant__ arrays live only in the file scope of hamiltonian_device.cpp. Normally I dislike defining stuff in the global scope or file scope, but defining the __constant__ arrays in a function scope and then passing the values as arguments to the kernels proved to be difficult if not impossible, so I have decided to let them live in the file scope.

Now! One notable array which is constant during the entire program run time is the array containing the $M$ basis states. It has the added complexity of being fricken huge. Well, not always as it is dependent on the number of valence particles, but for most interesting calculations its size will be in the millions and milliards.

The size is problematic in two ways; for Nvidia GPUs the __constant__ memory is very fast but quite small, on the order of tens of kB, sometimes 64 kB. This is most certainly too little for the basis states, and likely too little for the other __constant__ arrays. 5688*sizeof(double) = 45504 B which fills up more than half of the mentioned Nvidia __constant__ device memory size. However, __constant__ memory seems to be working a bit differently in the AMD universe. On my 7900 XTX I have the following stats:

prop.totalConstMem: 2147.48 MB
prop.totalGlobalMem: 25753 MB

where it seems that the __constant__ memory is far larger, however not covering the entire VRAM size of the GPU. I suspect that the __constant__ memory acts differently on the AMD platform. Anyway, all the __constant__ arrays except for the basis states will easily fit into that memory, case closed.

It is not possible to determine the size of the basis states array without some pre-calculation which is dependent on the parameters of the problem. I have therefore decided to put the basis states into the global device memory. However, I am not able to declare the device pointer in file scope as I did with the __constant__ arrays. Consider the following minimal (non)working example:

__device__ int *device_array;

__global__ void simple_kernel(int size, int* arr)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < size)
    {
        // device_array[idx] += 5; // Doesnt work! @Memory access fault by GPU node-1 (Agent handle: 0x75c4c0) on address (nil). Reason: Page not present or supervisor privilege.
        arr[idx] += 5;          // Works fine!
    }
}

int main()
{
    const int array_size = 256;
    int *host_array = new int[array_size];
    
    hipMalloc(&device_array, array_size * sizeof(int));
    hipMemcpy(device_array, host_array, array_size * sizeof(int), hipMemcpyHostToDevice);
    hipLaunchKernelGGL(simple_kernel, dim3(1), dim3(array_size), 0, 0, array_size, device_array);
    hipMemcpy(host_array, device_array, array_size * sizeof(int), hipMemcpyDeviceToHost);
    
    hipFree(device_array);
    delete[] host_array;

    return 0;
}

This approach would work if device_array was a __constant__ array. The solution for now is to pass all the non-__constant__ arrays as function arguments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constant and global device arrays and scopes #6

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Constant and global device arrays and scopes #6

Uh oh!

GaffaSnobb May 10, 2024 Maintainer

Replies: 0 comments

GaffaSnobb
May 10, 2024
Maintainer