Parallelize ExternalSource to maximize the loading pipeline
- This example expects clean data
- More comments in files
conda create --name dali
conda activate dali
conda install cupy -y
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/weekly nvidia-dali-weekly-cuda100
pip install aiapy
https://drive.google.com/file/d/1IMiCcm49WEw_cyJF4GCZLVW9_Gq77V-2/view?usp=sharing
Version 1 (--num_read_processes=16 --batch_size=16)
Time(%) Time (ns) Instances Avg (ns) Min (ns) Max (ns) Range
---------- ------------ ---------- -------------- ------------ ------------ ------
96.9 70224833901 1 70224833901.0 70224833901 70224833901 run
Version 2 (--num_read_processes=16 --batch_size=16 --num_gpus=4)
Time(%) Time (ns) Instances Avg (ns) Min (ns) Max (ns) Range
---------- ------------ ---------- -------------- ------------ ------------ ------
100.0 23609338458 1 23609338458.0 23609338458 23609338458 run
Version 3 (--num_read_processes=16 --batch_size=16)
Time(%) Time (ns) Instances Avg (ns) Min (ns) Max (ns) Range
---------- ------------ ---------- -------------- ------------ ------------ ------
95.0 34849683649 1 34849683649.0 34849683649 34849683649 run
Version 4 (--num_read_processes=16 --batch_size=16 --num_gpus=4)
Time(%) Time (ns) Instances Avg (ns) Min (ns) Max (ns) Range
---------- ------------ ---------- -------------- ------------ ------------ ------
100.0 15935145551 1 15935145551.0 15935145551 15935145551 run
- v2 is ~3.0x faster than v1
- v3 is ~2.2x faster than v1
- v4 is ~4.5x faster than v1