The efficient use of multicore architectures for sparse matrix-vector multiplication (SpMV) is currently an open challenge. One algorithm which makes use of SpMV is the maximum likelihood expectation maximization (MLEM) algorithm. When using MLEM for positron emission tomography (PET) image reconstruction, one requires a particularly large matrix. We present a new storage scheme for this type of matrix which cuts the memory requirements by half, compared to the widely-used compressed sparse row format. For parallelization we combine the two partitioning techniques recursive bisection and striping. Our results show good load balancing and cache behavior. We also give speedup measurements on various modern multicore systems.
- Initial Verison and Algorithmic, MPI Version: Tilman Kuestner
- LAIK Version, HPC Optimization: Josef Weidendorfer, Tilman Kuestner
- LAIK Version, Measurements, HPC Optimizations: Dai Yang, Tilman Kuestner
- The new Hybrid and OpenMP version: Rami Al-Rihawi, Tilman Kuestner, Dai Yang
- CUDA version: Apporva Gupta, Mengdi Wang, Dai Yang
Name | MPI | OpenMP | LAIK | Others | Description |
---|---|---|---|---|---|
openmpcsr4mlem | no | yes | no | Pure, native OpenMP Implementation. | |
openmpcsr4mlem-pin | no | yes | no | NUMA - Optimized OpenMP Version using thread Pinning | |
openmpcsr4mlem-knl | no | yes | no | numactl, memkind | Special Optimized version for Intel© Xeon Phi© Knight's Landing (KNL) Processors |
mpicsr4mlem | yes | no | no | Pure MPI Implementation | |
mpicsr4mlem2 | yes | yes | no | Hybrid MPI-OpenMP Implementation with Thread Pinning | |
mpicsr4mlem3 | yes | yes | no | Hybrid MPI-OpenMP Implementation with Thread Pinning, HBM Optimization and Cache Blocking | |
cudacsr4mlem | no | yes | no | CUDA + NCCL | High Performance CUDA implementation with OpenMP acceleration |
laikcsr4mlem | yes | no | yes | MLEM ported to LAIK to enable application-integrated Fault Tolerance. Tested with commit 3d1cbf2933f283885c826a8bc589225dd76d5ef4 of LAIK. |
For experiments and publications with MLEM, two different data sets are provided. These dataset are simimulated PET scanners using the DRF method with artifical scanner configurations that are close to real-life settings, however, they do not represent real-world scanners, so they should be used for evaluation purpose only.
- Boost Program Options, for the iterators and program options.
- C++11 compatible, preferable GNU Compiler
- OpenMP Support
- For LAIK version, in addition: LAIK Library
- For CUDA Version, in addition: CUDA 10, NVIDIA Collective Communication Library, cuSparse, cuBlas, spTrans
- MLEM Data Sets, optainable at TUM University Library, t.b.d.
make $TARGET
mpirun -np <num_tasks> ./mpicsr4mlem <matrix> <input> <output> <iterations> <checkpointing>
mpirun -np 4 ./mpicsr4mlem test.csr4 sino65536.sino mlem-60.out 60 0
mpirun -np <num_tasks> ./laikcsr4mlem <matrix> <input> <output> <iterations>
LAIK_LOG=2:0 mpirun -np 4 ./laikcsr4mlem test.csr4 sino65536.sino mlem-60.out 60
To test the repartitioning part, try:
SHRINK_ITER=5 SHRINK_FROM=10 SHRINK_TO=9 REPART_TYPE=2 LAIK_LOG=2:0 mpirun -np 10 ./laikcsr4mlem matrix.csr4 input.LMsino /dev/null 10
- -D_HPC_ enables the HPC optimization of this code. It uses memcpy instead of mmap to optimize NUMA operations.
- -DMESSUNG disables default outputs and enables output for measurement for clusters.
Any publication using the MLEM code must be informed to the authors of this code, e.g. Tilman Küstner.
- Original MLEM paper: L. A. Shepp and Y. Vardi, "Maximum Likelihood Reconstruction for Emission Tomography," in IEEE Transactions on Medical Imaging, vol. 1, no. 2, pp. 113-122, Oct. 1982. doi: 10.1109/TMI.1982.4307558 Link
- "Parallel MLEM on Multicore Architectures", Tilman Küstner, Josef Weidendorfer, Jasmine Schirmer, Tobias Klug, Carsten Trinitis, Sibylle I. Ziegler. In: Computational Science - ICCS 2009, LNCS, vol. 5544, pp. 491-500, Springer, 2009. Link
- "Enabling Application-Integrated Proactive Fault Tolerance", Yang, D., Weidendorfer, J., Trinitis, C., Küstner, T., & Ziegler, S. (2018). In Parallel Computing is Everywhere 32 (pp. 475-484). Link
- "Implementation and Evaluation of MLEM algorithm on Intel Xeon Phi Knights Landing (KNL) Processors", Rami Al-Rihawi, Master's Thesis (2018). Link
- "Implementation and Evaluation of MLEM-Algorithm on GPU using CUDA", Apoorva Gupta, Master's Thesis (2018). Link
- "Co-Scheduling in a Task-Based Programming Model", T. Becker, D. Yang, T. Kuestner, M. Schulz. In Proceedings of the 3rd Workshop on Co-Scheduling of HPC Applications (COSH 2018). DOI: 10.14459/2018md1428536.
- "Porting MLEM Algorithm for Heterogeneous Systems", Mengdi Wang, Bachelor's Thesis (2019). Link
- " Exploring high bandwidth memory for PET Image Reconstruction", Dai Yang, Tilman Küstner, Rami Al-Rihawi, Martin Schulz (2019). In Parallel Computing (ParCo 19'). Accepted for Publication. Link t.b.d.
This code is distributed as a open source project under GPL License Version 3. Please refer to LICENSE document.
The CSR Matrix transposition routine (spTrans) in cudacsr4mlem
is provided as an open-source software by Virginia Polytechnic Institute & State University (Virginia Tech) under the GNU Lesser General Public License v2.1.
This project is partially financed by project ENVELOPE, which is supported by Federal Ministry of Education and Research (BMBF) of the Federal Republic of Germany.
Compute Resources for development and testing is partially sponsored by the Leibniz Supercomputing Centre (LRZ).
spTrans is provided by Wang et al., originally published as Parallel Transposition of Sparse Data Structures. Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng. In Proceedings of the 30th International Conference on Supercomputing (ICS), Istanbul, Turkey, June 2016.