################################# PENCIL Image Processing Benchmark ################################# # Prerequisites ################ - An OpenCL driver, header files and libraries. - The following packages: * build-essential * cmake * TBB (Threading Building Blocks) - A PENCIL compiler (PPCG for example). - The PRL runtime library (included in PPCG). - The PENCIL header files (included in PPCG). - OpenCV (2.4.9.1 or above): * Tested with 2.4.9.1 and 2.4.10.1 * You can install the OpenCV packages (not tested) or download and compile the sources (as described below). * Download OpenCV git clone https://github.com/Itseez/opencv.git * Checkout 2.4.9.1 or newer: cd opencv git checkout 2.4.10.1 cd .. * If you are running the benchmark on ARM Mali GPU you need to apply the patch 0001-Make-image-filtering-erode-dilate-work-with-ARM-Mali.patch provided with the benchmark to OpenCV. This patch changes the workgroup sizes to smaller ones so that OpenCV kernels can run on Mali (which only accepts small workgroups). cd opencv git apply ../pencil-benchmarks-imageproc/0001-Make-image-filtering-erode-dilate-work-with-ARM-Mali.patch cd .. * Configure build with CMake: mkdir opencv-build cd opencv-build cmake ../opencv * Details: - If your CPU supports AVX/SSEx instruction sets, you can add the following to the "cmake ../opencv" command: -DENABLE_AVX=ON -DENABLE_SSE41=ON -DENABLE_SSE42=ON -DENABLE_SSSE3=ON -DENABLE_SSE3=ON - If your CPU is ARM, set the appropriate NEON/VFP switches. - Optional: add TBB usage with -DWITH_TBB=ON (requires libtbb-dev package). - Make sure the necessary modules are to be built - CMake should print a (long) status with a "To be built:" list, it should contain core, ocl and highgui (plus their dependencies). * Build OpenCV make all -j12 * Details: You can replace the number 12 with the number of threads of your processor(s) * Optionally, install OpenCV as a system library: sudo make install ############ # Building ############ Two methods are possible. You can either use bash scripts (in scripts/) or use CMake. # Building Using Bash Scripts (recommended): ############################################## We recommend this method as you can use the same scripts to perform auto-tuning which is not possible with the Cmake method. - Set the variables in scripts/scripts_config.conf to define the paths of the libraries, headers and tools used in the benchmark. PENCIL_COMPILER_BINARY: the full path of the PENCIL compiler binary PENCIL_INCLUDE_DIR: path of the directory containing "pencil.h" PRL_LIB_DIR: path of the directory containing the PRL library (libprl.so) PRL_INCLUDE_DIR: path of the directory containing "prl.h" OPENCL_LIB_DIR: path of the directory containing the OpenCL library OPENCL_INCLUDE_DIR: path of the OpenCL header files OPENCV_INCLUDE_DIR: path of the OpenCV header files OPENCV_LIB_DIR: path of the OpenCV library files LIST_OF_KERNELS: defines a blank separated list of the kernels that should be compiled or autotuned. Please look into the file scripts/scripts_config.conf for an example and for more details. You can use the default values of the variables except the path variables which should be set correctly. - You can use the following script to build the benchmark: ./scripts/compile_and_run_kernels.sh A log is generated in build/benchmark_building_log.txt # Building Using CMake : ######################### To build the benchmark, you need: - Compile benchmark: Crate a directory for out-of-source build, run cmake and make: - CMake can be configured from a GUI window using cmake-gui. (recommended: list all parameters that can be change) - OpenCL might not be found by the CMake script. In this case, set OPENCL_LIBRARY manually. It is a required parameter for ARM Mali: list BOTH libOpenCL.so and libmali.so. - Based on your OpenCL device, it is recommended to set up PENCIL_DEFAULT_FLAGS_* variables: The default values are set to be the common denominator to allow it to work everywhere, but the performance is not optimal. PENCIL_DEFAULT_FLAGS_BLOCKSIZE should be set to the maximum workgroup sizes. For instance, AMD Radeon cards should use "16,16" (256 threads total) PENCIL_DEFAULT_FLAGS_LOCAL_MEMORY_SIZE is the amount of used __local memory. Alternatively set to 0 to disable __local memory usage (useful for ARM Mali). These parameters are reported by the tool clinfo: First is "Max work items[x]:" (individually) and "Max work group size:" (the multiply of values); second is "Max local memory:" - Optionally, as performance tuning try turning PENCIL_DEFAULT_FLAGS_MAXFUSE, PENCIL_DEFAULT_FLAGS_NO_SEPARATE_COMP, PENCIL_DEFAULT_FLAGS_DISABLE_PRIVATE flags ON or OFF. - If something is not installed at the standard paths (/usr, /usr/local), then run cmake-gui and edit the variables (should be straight-forward) - Set CMAKE_BUILD_TYPE to Release for optimized build, set to Debug to build without optimizations and with debug info. mkdir pencil-benchmarks-imageproc-build cd pencil-benchmarks-imageproc-build cmake -DCMAKE_BUILD_TYPE=Release -DOPENCL_LIBRARY=/path/to/libOpenCL.so -DPENCIL_DEFAULT_FLAGS_BLOCKSIZE="A,B" -DPENCIL_DEFAULT_FLAGS_LOCAL_MEMORY_SIZE=C path/to/pencil-benchmarks-imageproc-repo make all -j12 The CMake cache is persistent between builds, you only need to supply a parameter when you want to override it. - (Optional) Tune per-kernel parameters: We usually need to support separate tile-grid-block sizes for different kernels, and allows access to advanced parameters of the polyhedral compiler. You can supply these parameters using a corresponding PENCIL_FLAGS_* parameter, separate for every benchmark item. If this is not supplied, the default flags are used. - Warning: Due to CMake restrictions, some characters (quotes, backslash) needs escaping. Also, the semicolon character cannot be used in these strings. instead of: --sizes={kernel[i]->tile[8,8];kernel[i]->grid[8,8];kernel[i]->block[8,8]} use: --sizes=\"{kernel[i]->tile[8,8]}\" --sizes=\"{kernel[i]->grid[8,8]}\" --sizes=\"{kernel[i]->block[8,8]}\" - Build benchmark: cd repository_root_path make all -j12 - Examples of tested hardware parameters (to use with CMake) AMD Radeon R9 290 -DPENCIL_DEFAULT_FLAGS_BLOCKSIZE="16,16" -DPENCIL_DEFAULT_FLAGS_LOCAL_MEMORY_SIZE=32768 -DPENCIL_DEFAULT_FLAGS_MAXFUSE=ON -DPENCIL_DEFAULT_FLAGS_NO_SEPARATE_COMP=ON -DPENCIL_DEFAULT_FLAGS_DISABLE_PRIVATE=OFF nVidia Tesla M2050 -DPENCIL_DEFAULT_FLAGS_BLOCKSIZE="32,32" -DPENCIL_DEFAULT_FLAGS_LOCAL_MEMORY_SIZE=49152 -DPENCIL_DEFAULT_FLAGS_MAXFUSE=ON -DPENCIL_DEFAULT_FLAGS_NO_SEPARATE_COMP=ON -DPENCIL_DEFAULT_FLAGS_DISABLE_PRIVATE=OFF Intel HD Graphics 4000 -DPENCIL_DEFAULT_FLAGS_BLOCKSIZE="32,16" -DPENCIL_DEFAULT_FLAGS_LOCAL_MEMORY_SIZE=65536 -DPENCIL_DEFAULT_FLAGS_MAXFUSE=ON -DPENCIL_DEFAULT_FLAGS_NO_SEPARATE_COMP=ON -DPENCIL_DEFAULT_FLAGS_DISABLE_PRIVATE=OFF ARM Mali T628 -DPENCIL_DEFAULT_FLAGS_BLOCKSIZE="8,8" -DPENCIL_DEFAULT_FLAGS_LOCAL_MEMORY_SIZE=0 -DPENCIL_DEFAULT_FLAGS_MAXFUSE=ON -DPENCIL_DEFAULT_FLAGS_NO_SEPARATE_COMP=ON -DPENCIL_DEFAULT_FLAGS_DISABLE_PRIVATE=OFF Please, share the parameters to your device here # Auto-tuning and Time Measurements ###################################### - You can use the following script to perform auto-tuning for the PPCG PENCIL compiler. ./scripts/ppcg_tuning.sh The script generates many PPCG options internaly, and launches PPCG with each one of these options, then will compile each file generated by PPCG and will execute it. The execution times measured for each kernel are reported in build/output_time.$KERNEL_NAME.csv (a list of all the options explored by the auto-tuner is provided along with the execution time for each option). The best execution times for each kernel are reported in build/output_time.csv The best PPCG options obtained through tuning are reported in the file build/best_optimizations_log.sh A log is generated in build/benchmark_building_log.txt - You can use the following script to build the benchmark using the default PPCG options: ./scripts/compile_and_run_kernels.sh The script will generate the file build/output_time.csv that constains the timings for each kernel. In the generated timings file, the total_execution_time_reference column indicates the total execution time for the OpenCV calls (this includes the data copy time and the kernel execution time but does not include kernel compilation time, the OpenCV kernel is precompiled by being run once before time measurements are taken). The total_execution_time_optimized column indicates the total execution time for the PPCG generated code (this includes the data copy times, the kernel execution time, the time spent by any host code generated by the PENCIL compiler but does not include kernel compilation time, kernels are precompiled also by being run once before taking time measurements). kernel_only_execution_time_optimized column indicates kernel execution time measured using OpenCL profiling (does not include data copies and kernel compilation). The script uses default PPCG options except for the workgroup sizes where workgroups of size 16x16 are set instead of the default 32x32 workgroups which may not work on some architectures. Please note that in some cases (e.g. for ARM Mali) you might need to set an even lower value. - You can also use the previous script to build the kernels with specific PPCG options. In this case you need to pass a file containing the PPCG options as an argument. ./scripts/compile_and_run_kernels.sh <ppcg_options_file> The previous command compiles and runs the kernels using the options defined in <ppcg_options_file> (for example you can use the file scripts/ppcg_preset_options/best_optimizations_log.Nvidia_GTX470.sh). Such a file is obtained in general using autotuning (as described above the auto-tuning script generates a file containing the best optimization options that were found during auto-tuning and which you can reuse if you don't want to redo autotuning). <ppcg_options_file> provides an option for each kernel listed in $LIST_OF_KERNELS. The option that corresponds to the i'th kernel in the $LIST_OF_KERNELS is stored in best_optimization_options[$i]. The folder scripts/ppcg_preset_options/ contains many files with preset options. # Running Individual Kernels Manually ####################################### cd build/ ./ppcg_test_KERNEL_NAME <path_to_image> Details: - Example images are in images/ - If you use CMake, it will also produce a test_* files. - These binaries call the original PENCIL code not the code generated by the PENCIL compiler. - Each executable runs a different operator with all three (C++, OpenCL, PENCIL) implementations. - Results are cross-checked, and if there is no difference (within a small allowed precision error), total times are reported at the end. # Troubleshooting ################## - In case of a compilation/execution error, please check the log file in build/benchmark_building_log.txt - If you get an execution error while runing a script, you should run a single kernel to see what is exactly the error message, to run the resize kernel for example: cd build/ ./ppcg_test_resize ../images/M104_ngc4594_sombrero_galaxy_hi-res.jpg this will run the resize kernel (OpenCV CPU, OpenCV OpenCL and the code generated using the PENCIL compiler) on the M104_ngc4594_sombrero_galaxy_hi-res.jpg image. - You may get a CL_INVALID_WORK_GROUP_SIZE error, this means that the workgroup sizes set by the PENCIL compiler are too big for your architecture. You can reduce the workgroup sizes for autotuning by removing big workgroup sizes from the variable POSSIBLE_BLOCK_SIZES If you are using the compile_and_run_kernels.sh script, you can set the options of the PENCIL compiler by creating a file similar to the file scripts/ppcg_preset_options/ppcg_default_options.sh