Numpy reader test (GDS)
asdfry opened this issue · 4 comments
Describe the question.
Specifications of the test server
- GPU: A100 80GB * 8
- System Memory: 2TB
Hello, I am comparing the time it takes to complete one epoch by adjusting the arguments of the numpy reader.
While analyzing the results, I encountered some parts that I do not understand and would like to ask about them.
- When
DALI_GDS_CHUNK_SIZE
is set to 16M, the results seem to mimic the effect of using cache.
chunk_size | epoch_1 | epoch_2 | epoch_3 | epoch_4 | epoch_5 | mean |
---|---|---|---|---|---|---|
1M | 647 | 653 | 664 | 666 | 658 | 657.6 |
16M | 63 | 39 | 38 | 38 | 37 | 43 |
- The speed for first epoch is faster when the size of the dataset is larger. (HD: 859G, FHD: 1.9T)
dataset_resolution | epoch_1 | epoch_2 | epoch_3 | epoch_4 | epoch_5 | mean |
---|---|---|---|---|---|---|
HD | 360 | 35 | 33 | 33 | 29 | 98 |
FHD | 63 | 39 | 38 | 38 | 37 | 43 |
Here is the source code I used, which was based on an example.
@pipeline_def
def create_dali_pipeline(
data_dir,
dali_cpu,
o_direct,
prefetch_queue,
seed,
shard_id,
num_shards,
):
images = fn.readers.numpy(
device="cpu" if dali_cpu else "gpu",
file_root=data_dir,
file_filter="*.image.npy",
pad_last_batch=True,
shuffle_after_epoch=True,
dont_use_mmap=True,
use_o_direct=o_direct,
prefetch_queue_depth=prefetch_queue,
seed=seed,
shard_id=shard_id,
num_shards=num_shards,
name="Reader",
)
labels = fn.readers.numpy(
device="cpu" if dali_cpu else "gpu",
file_root=data_dir,
file_filter="*.label.npy",
pad_last_batch=True,
shuffle_after_epoch=True,
dont_use_mmap=True,
use_o_direct=o_direct,
prefetch_queue_depth=prefetch_queue,
seed=seed,
shard_id=shard_id,
num_shards=num_shards,
)
return images, labels
def main(args):
os.environ["DALI_GDS_CHUNK_SIZE"] = args.chunk_size
train_pipe = create_dali_pipeline(
batch_size=args.batch_size,
device_id=args.local_rank,
num_threads=4,
data_dir=args.data_dir,
dali_cpu=args.dali_cpu,
o_direct=args.o_direct,
prefetch_queue=args.prefetch_queue,
seed=12 + args.local_rank,
shard_id=args.local_rank,
num_shards=args.world_size,
)
if args.local_rank == 0:
logger.info("Building training pipeline")
train_pipe.build()
train_loader = DALIClassificationIterator(
train_pipe,
reader_name="Reader",
last_batch_policy=LastBatchPolicy.PARTIAL,
auto_reset=True,
)
log_interval = len(train_loader) // 20
for epoch in range(5):
start = time.time()
for step, batch in enumerate(train_loader):
if (args.local_rank == 0) and (log_interval > 0) and (step % log_interval == log_interval - 1):
logger.info(f"[epoch {epoch+1}] step: {step+1} / {len(train_loader)}")
if args.local_rank == 0:
logger.info(f"[epoch {epoch+1}] time: {time.time() - start}")
Could you explain the reasons behind these observations?
Check for duplicates
- I have searched the open bugs/issues and have found no duplicates for this bug report
Hi @asdfry,
Thank you for reaching out.
I have a few questions regarding your measurements:
- do you clean disc caches before each measurement? Do you see any impact of this action on your numbers?
- Based on the measurements documented at #3972 (comment) the performance difference is usually observed with much smaller chunk sizes. The one you use should provide similar perf. Can you measure other sizes as well?
- When it comes to GDS the speed should be not sensitive to the data set size (unless it is very small, like a couple of MBs or less). Can you test the performance without GDS enabled?
Hi @JanuszL,
Thank you very much for your prompt response.
As you suggested, I cleared the disk cache before each measurement and tested with both CPU and GPU reader types. (I set the chunk size to 16M for the tests)
Because of clearing the cache, the results seem somewhat more consistent.
However, I am still puzzled by the outcomes of cases idx 1, idx 3, and idx 5...
idx | resolution | reader | protocol | o_direct | epoch_1 | epoch_2 | epoch_3 | epoch_4 | epoch_5 | mean |
---|---|---|---|---|---|---|---|---|---|---|
0 | sd | cpu | pcie | FALSE | 236 | 18 | 15 | 13 | 14 | 59.2 |
1 | sd | gpu | rdma | FALSE | 33 | 33 | 33 | 33 | 30 | 32.4 |
2 | hd | cpu | pcie | FALSE | 510 | 40 | 28 | 29 | 26 | 126.6 |
3 | hd | gpu | rdma | FALSE | 34 | 35 | 35 | 35 | 31 | 34 |
4 | fhd | cpu | pcie | FALSE | 830 | 193 | 149 | 121 | 93 | 277.2 |
5 | fhd | gpu | rdma | FALSE | 42 | 36 | 39 | 37 | 38 | 38.4 |
6 | qhd | cpu | pcie | FALSE | 1797 | 1289 | 1162 | 1044 | 1101 | 1278.6 |
7 | qhd | gpu | rdma | FALSE | 1326 | 1100 | 1063 | 1070 | 1055 | 1122.8 |
8 | uhd | cpu | pcie | FALSE | 3343 | 3229 | 3071 | 3076 | 3140 | 3171.8 |
9 | uhd | gpu | rdma | FALSE | 3044 | 2608 | 2558 | 2515 | 2481 | 2641.2 |
Hi @asdfry,
Can you calculate the throughput for each data set, because now as I understand you record time spent?
I see that qhd vs uhd is ~2.25 more data, and the times are also proportionally bigger (>2x times). When it comes to sd, hd and fhd maybe the loading times are hidden by python overhead. When you time DALI that way you record time spend waiting for data. As DALI does prefetching the ideal case should be close to 0 - the loading overlaps with the rest of the execution. You can try to capture the nsight profile to see how long each operator takes.
Thank you very much for your kind and prompt response to my issue.