Implementation of SHA-256 on the GPU (and sample implementation on the CPU).
To compile the CPU version of the code:
g++ sha.cpp -o shacpu -std=c++20 -O3
The GPU version of the code was designed for a Nvidia 2080Ti, and likely will not run on any older hardware since it attempts to use the maximum number of threads possible. To compile it, run:
nvcc sha.cu -o shacuda -O3
# for GPU version
./shacuda <path-to-file>
# for CPU version
./shacpu <path-to-file>
First, you initialize an array of 8 32-bit integers that will represent the hash to the first 32 bits of the fractional parts of the square roots of the first 8 primes. We also initialize a constant array k
that is the first 32 bits of the fractional parts of the cube roots of the first 64 primes.
SHA256 processes the input in 512-bit chunks, and for each chunk we generate a message schedule, an array of 64 32-bit integers. The first 16 integers in this array are just copied from the input buffer. The next 48 are calculated via the following process (where rotr
is a right bit rotation, aka a cyclic right bit shift):
for (int i = 16; i < 64; i++) {
uint32_t s0 = (rotr(w[i-15], 7) ^ rotr(w[i-15], 18) ^ (w[i-15] >> 3));
uint32_t s1 = (rotr(w[i-2], 17) ^ rotr(w[i-2], 19) ^ (w[i-2] >> 10));
w[i] = w[i-16] + s0 + w[i-7] + s1;
}
This message schedule is then used in the compression phase of SHA256 where the current hash values are copied into another array a
and operated on 64 times (once for each element in the message schedule and k
). This is done as follows:
for (int i = 0; i < 64; i++) {
uint32_t s1 = (rotr(a[4], 6) ^ rotr(a[4], 11) ^ rotr(a[4], 25));
uint32_t ch = (a[4] & a[5]) ^ ((~a[4]) & a[6]);
uint32_t temp1 = a[7] + s1 + ch + k[i] + w[i];
uint32_t s0 = (rotr(a[0], 2) ^ rotr(a[0], 13) ^ rotr(a[0], 22));
uint32_t maj = (a[0] & a[1]) ^ (a[0] & a[2]) ^ (a[1] & a[2]);
uint32_t temp2 = s0 + maj;
for (int i = 7; i > 0; i--) a[i] = a[i-1];
a[4] += temp1;
a[0] = temp1 + temp2;
}
Sidenote: because a
is initialized to the current hash, this process is inherently sequential.
Afterwards, we add each element of a
to the hash. Finally, the last buffer is padded with an extra 1 bit after the input and the length of the array is stored in the last 64 bits of the 512-bit chunk.
shacuda
uses a sha_ctx
class to store the hash.
When run, the code starts by calling fstat
on the file provided to determine its size. It then allocates two large buffers on the GPU: a 0.25 GiB one called buf
that the file is continously read into, and another 1 GiB one w
that is meant to store all of the message schedules for the file.
The primary part of the code is a do-while loop in the main function that continously reads the file to buf
. Within this while loop there are essentially two conditions handled: whether the was fully or partially read into.
If the buffer is full, we call a kernel process
with a grid of 8x8 thread blocks with 64 threads (in total 4096 threads). Each thread is responsible for generating the message schedule for 1024 of the 512-bit “chunks” that SHA operates on. To do this, they calculate their “thread index” (essentially determining which thread they are), and generate from this a unique offset in the large w
array to write their output into and offset in the buf
array to read the input from. This means that the operation is parallelized in a rather clean manner: each thread is responsible for its own section of the input and own section of the output. Each thread then generates their respective message schedules via the method outlined by the SHA algorithm, albeit with some byte swapping because SHA operates on big endian numbers whilst x86 is little endian. Once this is done, a function running on the CPU does the “compression” phase of SHA by iterating over all of the generated message schedules in w
sequentially and updating the hash in the sha_ctx
class. Convieniently, we do not need to do all of our arithmetic mod 32 when doing math for these steps since 32-bit overflow is defined behavior in C/C++.
If the buffer is not full, we pad it as SHA specifies by clearing out the input part of the w
array, writing a 1 bit after the input and the length at the end of the 512-bit chunk.
Finally, after this while loop we do a final chunk with the SHA padding if we only ever had filled buffers, and then we print out the hash!
Since kernels are pretty expensive to call, there’s a tradeoff between VRAM usage and how many kernels we call: we could have 5 gigabytes of VRAM allocated for the input and output buffers and then call less kernels to process the data (since they’re each capable of working on more at a time), but a higher memory footprint.
Neither optimization is nearly as optimized as I’d like it to be: being smarter about how kernels are called and doing some more extensive profiling would likely help speed things up quite a bit. Furthermore, I’d like to clean up the code some and also make it accesible as a library rather than just an executable to call.