gpu-performance-test

This repository contains the code to test GPU memory and branch penalty

Branch penalty

The pseudocode for the benchmark is:

// Generate a random array with a discrete distribution
bool random[32] = [1 with probability p else 0 ]
copy random to GPU
__global__ kernel (bool *random){
    int idx = (threadIdx.x + blockIdx.x * blockDim.x) % 32;
    if (idx) {
        do something
    } else {
        do the same thing with different parameters
    }
}

Note: Waps are assumed of size = 32.

The CPU code generates an array R with 32 elements. Each element has a discrete distribution: it is 1 with probability p, and 0 with probability 1-p. This array is then copied to the GPU.

In the GPU kernel, the variable idx is computed as THREAD_ID % 32. If R[idx] is 1, the kernel performs a certain action, while if it is 0, the kernel executes a function with the same number of PTX instructions. This results in branch divergence every time a 0 is encountered, which can be seen in the graph below.

The results show that if even one warp diverges, the performance of the program is halved.

Memory latency

The pseudocode for the benchmark is:

auto vector = [i for i in N]
vector.shuffle();
copy the vector to the GPU
__global__ kernel(uint64_t* vector) {
    int tid = (threadIdx.x + blockIdx.x * blockDim.x) % N;
    auto index = tid;
    for (i=0 to reads) {
        index = vector[index];
    }
}

The code begins by creating an array where array[i] = i and then shuffling it using the xoroshiro-cpp random number generator. This method is used to avoid random number generation on the GPU, which can affect the accuracy of the results. The resulting array is then copied to the GPU.

The kernel uses the thread ID to address one element of the array, and then assigns index = vector[index] in a loop that is executed read times. Because of how the array is constructed, this results in uniform stochastic memory accesses. The kernel is executed read times to dilute the calling overhead, which can affect the accuracy of the results.

The results show that when the array fits in cache, the latency is around 30ns. However, when the array no longer fits in cache, the latency increases to ~170ns, and in some cases, up to ~300ns. If memory contention (parallel accesses from multiple threads) is introduced, the latency can increase to around ~360ns. This can be seen in the graph above. For comparison shared memmory has a read latency of 17ns with or without contention.

Note: tests are executed on a NVIDIA RTX 3090ti and CUDA V11.8.89

Github actions from: https://github.com/Ahdhn/CUDATemplate

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
images		images
include		include
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
bandwidth.cu		bandwidth.cu
branch.cu		branch.cu
latency.cu		latency.cu
plots.ipynb		plots.ipynb
shared.cu		shared.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpu-performance-test

Branch penalty

Memory latency

Note: tests are executed on a NVIDIA RTX 3090ti and CUDA V11.8.89

About

Uh oh!

Releases

Packages

Languages

License

DiamonDinoia/gpu-performance-tests

Folders and files

Latest commit

History

Repository files navigation

gpu-performance-test

Branch penalty

Memory latency

Note: tests are executed on a NVIDIA RTX 3090ti and CUDA V11.8.89

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages