/Global-Memory-Tracing

Compiler Plugin for Global Memory traces in CUDA applications

Primary LanguageJupyter NotebookMIT LicenseMIT

Introduction

Instrumentation to trace Global Memory operations in CUDA applications for sake of communication analysis. Consists of Clang front-end plugin for AST Manipulation and a LLVM pass for trace instrumentation. Applications instrumented with a tracing, generate a trace file in the /tmp folder.

The application creates one trace file for each CUDA stream.

Trace Format

Each file has the following format:

	<size of one record, in bytes> //stored in a single byte
	<kernel name> // name of kernel, followed by LF
	<records> // all trace records, as one binary blob
	000000000000000000000000 // EOT
	<kernel name>
	<records>
	000000000000000000000000
	<EOF>	

Record Format

Each record has the same structure, and looks like this:

	|32 bit |4 bit   |28 bit |64 bit |64 bit        	  |
	|191 160|159  156|155 128|127  64|63   32|31  16|15  0|
	|SMID   |Type    |Size   |Address|CTA.x  |CTA.y |CTA.z|

Requirements

  • Clang/LLVM v. 5.0 or higher
  • CMake

Build Project

$ cd memtrace-pass/build
$ cmake ..
$ make

Instrumentation

Instrumenting an application at compile time happens in two steps. The first step is the instrumenting all source file containing:

  • Kernel Calls

  • Kernel and __device__ function definitions

  • Kernel __device__ functoin declarations

Each file needs to be parsed separately. For each file, a new file with the prefix ’augmented-’ is created. This is then used for the further build process. Compile units bear some problems with kernels defined and declared in .cu files, which are then included in other files. The easiest workaround is to copy the kernel into the file including it, and then only use the including file for instrumentation and build.

The following snipped is an example how to augment a kernel file.

clang++ -Xclang -load -Xclang "clang-plugin/libMemtrace-AA.so"
     -Xclang -plugin -Xclang cuda-aug -Xclang -plugin-arg-cuda-aug 
     -Xclang -f -Xclang -plugin-arg-cuda-aug -Xclang ./augmented-kernel.cu 
     -I$CUDASYS/samples/common/inc -I. -I../utils
     --cuda-path=$CUDASYS -std=c++11 -E application/kernel.cu 

Next, the host and device utils are compiled, for later linking with the application.

// Host
clang++ -c -L$CUDASYS/lib64 --cuda-path=$CUDASYS 
 -I$CUDASYS/samples/common/inc -O1 --cuda-gpu-arch=sm_30
  -I$CUDASYS/include  -o hutils.o --std=c++11 -I. -I../utils
 ../utils/TraceUtils.cpp
// Device
clang++ -c -L$CUDASYS/lib64 --cuda-path=$CUDASYS 
 -I$CUDASYS/samples/common/inc -O1 --cuda-gpu-arch=sm_30
 -I$CUDASYS/include -o dutils.o --std=c++11 -I. -I../utils
 ../utils/DeviceUtils.cu

Next, all the augmented kernel files are compiled at once.

clang++ -c -Xclang -load -Xclang $LLVMPLUGIN  --cuda-path=$CUDASYS
 -I$CUDASYS/samples/common/inc --cuda-gpu-arch=sm_30 -L$CUDASYS/lib64  -O1
 -lcudart_static -m64  --std=c++11 -I. -I../utils -I<app-includes>
  ./augmented-kernel.cu ./augmented-kernel2.cu

Finally, all compiled files are linked together.

clang++ --cuda-path=$CUDASYS -I$CUDASYS/samples/common/inc --cuda-gpu-arch=sm_30 -L$CUDASYS/lib64  -O1
 -lcudart -ldl -lrt -L. -m64 --std=c++11 -I. -I../utils -I<app-includes>
 -o application dutils.o hutils.o augmented-kernel.o augmented-kernel2.o

During the execution one or more files are generated in the /tmp folder. The files are named MemTrace-pipe-<n>, one for each stream.

Analyse Data

Any software related to what that happens after the analysis, is stored in ’memtrace-pass/post-processing’. The trace files can be parsed with the ’extract-subset.py’ python3 script. From the original full trace, the communication subset is extracted. Three data structures are generated by ’extract-subset.py’. All structures are stored using pickle. The data structures are of type ’AutoDict()’, which is defined in ’TraceInc.py’

  1. Load/Store volumes, both total and during communication, for Kernels, CTAs and SMs. Separated by kernel and superstep.

    "KCV" : { // Kernel Communication Volume
        <kernel>: {
            <superstep>: {
                "Load" : <int>
                "Store": <int>
            }
        }
    }
    "KDV" : { // Total Kernel Data Volume
        <kernel>: {
            <superstep>: {
                "Load" : <int>
                "Store": <int>
            }
        }
    }
    "CCV" : { // CTA Communication Volume
        <kernel>: {
            <superstep>: {
                <CTA> : {
                    "Load" : <int>
                    "Store": <int>
                }
            }
        }
    }
    
    "CDV" : { // Total CTA Data Volume
        <kernel>: {
            <superstep>: {
                <CTA> : {
                    "Load" : <int>
                    "Store": <int>
                }
            }
        }
    }
    "SCV" : { // SM Communication Volume
        <kernel>: {
            <superstep>: {
                <SM> : {
                    "Load" : <int>
                    "Store": <int>
                }
            }
        }
    }
    
    "SDV" : { // Total SM Data Volume
        <kernel>: {
            <superstep>: {
                <SM> : {
                    "Load" : <int>
                    "Store": <int>
                }
            }
        }
    }
    
        
    
  2. Map of transfers, ordered by kernel, CTA and superstep.

    <source-kernels> : {
        <source-cta> : {
            <source-superstep> : {
                <recv-kernel>: {
                    <recv-cta>: {
                        <recv-superstep>: {
                            'Size' : <int>
                            'cnt'  : <int>
                        }
                    }
                }
            }
        }
    }
        
    
  3. All addresses involved in communication, with all operations performed on this address, in superstep order.

    {
        <address> : [
            { // record
                "kernel": <kernel>
                "it"    : <superstep>
                "cta"   : <CTA-ID, xyz order>
                "addr"  : <address>
                "smid"  : <SM id>
                "size"  : <operation size in bytes>
                "type"  : <type of MOp>
            }
        ]
    }