This repository contains efforts to optimize the Parquet file reading capabilities of libcudf
. We explore various optimizations including memory management, multi-streaming, and multi-threading techniques to enhance performance.
- CMake
- CUDA environment
- libcudf and libarrow libraries
You could refer this link to install libcudf: https://github.com/rapidsai/cudf
And refer this link to install libarrow:https://arrow.apache.org/install/
I have also summarized my installation steps, along with the issues I encountered and their solutions, in the Libcudf_compile.md file. I hope this document will help make your compilation successful.
-
Clone the repository:
git clone https://github.com/gm3g11/libcudf_parquet_reading.git -
Navigate to the project directory: cd /xxx_dir/libcudf_parquet_reading
-
Modify the CMakeLists.txt
Configure your project with different project name and cuda file. E.g., in this CMakeLists.txt, we are using "basic_example" (line 6) as a project name and compile "chunked_parquet_read_self_implmentation_multi_streams_multi_threads.cpp" (line 14)
-
Compile and runtime command in build.sh: bash build.sh
Explain:
A. Pre-compile command: cmake -S . -B build/ -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_CXX_STANDARD=17 -Dcudf_ROOT=/home/gymeng/Desktop/cudf/cudf_24.06/cudf/cpp/build
(For the -Dcudf_ROOT, you need to specify your libcudf directory)
B. Compile command: cmake --build build/ --parallel 16
C. Runtime command: ./build/basic_example
(Need to modify these commands depending on your cases.)
For more detailed information, you can refer this report: Libcudf_project_update.pdf
-
lib_arrow.cpp: The benchmark method derived from Apache Arrow. Compile command: g++ -O3 -fopenmp -o arrow_parquet -std=c++17 lib_arrow.cpp -lparquet -larrow -lgomp Runtime command: ./arrow_parquet your_parquet_file
-
parquet_read_baseline.cpp: The benchmark method for libcudf
(You can modify this file with different settings, including: pool_memory, ramfs)
-
chunked_parquet_read_self_implmentation.cpp: chunked parquet read
-
chunked_parquet_read_self_implmentation_multi_streams_multi_threads.cpp: chunked parquet read + streams + threads
For questions or support, please open an issue in the GitHub repository or directly drop me an e-mail: gmeng@nd.edu.