serial: Serial version step1: Added simple acc directives #pragma acc routine seq #pragma acc parallel loop The image array is transferred to the GPU unnecessarily step2: Added in unstructured data region (enter, exit) #pragma acc enter data create(image[0:3*(image_width*image_height)]) #pragma acc exit data delete(image[0:3*(image_width*image_height)]) Added present clause to acc parallel loop #pragma acc parallel loop present(image[0:3*(image_width*image_height)]) -> This eliminates data transfer to device Added explict host update #pragma acc update self(image[0:3*(image_width*image_height)]) step3: Added block/tile loop to separate image into chunks of rows -> Compute only Added separate pragma for "data present" instead of attaching to loop clause #pragma acc data present(image[0:3*(image_width*image_height)]) #pragma acc parallel loop step4: Added blocking of data transfers by moving the "update pragma" inside the block loop and changing bound of image update #pragma acc update self(image[block*(3*block_height*image_width):block_height*(3*image_width)]) step5: Now that computation and data transfers are blocked/tiled, allow for asynchronous work by adding async clauses to the parallel loop and update self pragmas async(block % 2 + 1) -> use modulus to assign blocks to 2 separate cuda streams depending on block id -> the "+1" is to ensure we're not using the default stream We must also synchronize before data is accessed (written to file) #pragma acc wait step6: Compute portion of blocks on each of the 4 GPUs on Summitdev Query for # of GPUs int num_gpus = acc_get_num_devices(acc_device_nvidia); Add OpenMP parallel region to divde blocks among GPUs #pragma omp parallel Assign 1 GPU per OpenMP thread (This requires num_gpus = num_omp_threads) int omp_thread_id = omp_get_thread_num(); acc_set_device_num(omp_thread_id % num_gpus, acc_device_nvidia); Move "acc data create" inside "omp parallel" region so that image array is allocated on each GPU -> Each GPU does not actually need a copy of the entire array, but for now we leave it. Add OpenMP parallel loop directive to block loop to divde blocks among GPUs #pragma omp for NOTE: This problem is NOT load balanced. Some GPUs have more compute intensive portions of the image to work on. step7: Balance workload among GPUs step8: Decompose grid using MPI