NVIDIA/DALI

Numpy reader test (GDS)

asdfry opened this issue · 4 comments

Describe the question.

Specifications of the test server

  1. GPU: A100 80GB * 8
  2. System Memory: 2TB

Hello, I am comparing the time it takes to complete one epoch by adjusting the arguments of the numpy reader.
While analyzing the results, I encountered some parts that I do not understand and would like to ask about them.

  1. When DALI_GDS_CHUNK_SIZE is set to 16M, the results seem to mimic the effect of using cache.
chunk_size epoch_1 epoch_2 epoch_3 epoch_4 epoch_5 mean
1M 647 653 664 666 658 657.6
16M 63 39 38 38 37 43
  1. The speed for first epoch is faster when the size of the dataset is larger. (HD: 859G, FHD: 1.9T)
dataset_resolution epoch_1 epoch_2 epoch_3 epoch_4 epoch_5 mean
HD 360 35 33 33 29 98
FHD 63 39 38 38 37 43

Here is the source code I used, which was based on an example.

@pipeline_def
def create_dali_pipeline(
    data_dir,
    dali_cpu,
    o_direct,
    prefetch_queue,
    seed,
    shard_id,
    num_shards,
):
    images = fn.readers.numpy(
        device="cpu" if dali_cpu else "gpu",
        file_root=data_dir,
        file_filter="*.image.npy",
        pad_last_batch=True,
        shuffle_after_epoch=True,
        dont_use_mmap=True,
        use_o_direct=o_direct,
        prefetch_queue_depth=prefetch_queue,
        seed=seed,
        shard_id=shard_id,
        num_shards=num_shards,
        name="Reader",
    )
    labels = fn.readers.numpy(
        device="cpu" if dali_cpu else "gpu",
        file_root=data_dir,
        file_filter="*.label.npy",
        pad_last_batch=True,
        shuffle_after_epoch=True,
        dont_use_mmap=True,
        use_o_direct=o_direct,
        prefetch_queue_depth=prefetch_queue,
        seed=seed,
        shard_id=shard_id,
        num_shards=num_shards,
    )
    return images, labels

def main(args):
    os.environ["DALI_GDS_CHUNK_SIZE"] = args.chunk_size

    train_pipe = create_dali_pipeline(
        batch_size=args.batch_size,
        device_id=args.local_rank,
        num_threads=4,
        data_dir=args.data_dir,
        dali_cpu=args.dali_cpu,
        o_direct=args.o_direct,
        prefetch_queue=args.prefetch_queue,
        seed=12 + args.local_rank,
        shard_id=args.local_rank,
        num_shards=args.world_size,
    )

    if args.local_rank == 0:
        logger.info("Building training pipeline")
    train_pipe.build()

    train_loader = DALIClassificationIterator(
        train_pipe,
        reader_name="Reader",
        last_batch_policy=LastBatchPolicy.PARTIAL,
        auto_reset=True,
    )

    log_interval = len(train_loader) // 20
    for epoch in range(5):
        start = time.time()
        for step, batch in enumerate(train_loader):
            if (args.local_rank == 0) and (log_interval > 0) and (step % log_interval == log_interval - 1):
                logger.info(f"[epoch {epoch+1}] step: {step+1} / {len(train_loader)}")
        if args.local_rank == 0:
            logger.info(f"[epoch {epoch+1}] time: {time.time() - start}")

Could you explain the reasons behind these observations?

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report

Hi @asdfry,

Thank you for reaching out.
I have a few questions regarding your measurements:

  1. do you clean disc caches before each measurement? Do you see any impact of this action on your numbers?
  2. Based on the measurements documented at #3972 (comment) the performance difference is usually observed with much smaller chunk sizes. The one you use should provide similar perf. Can you measure other sizes as well?
  3. When it comes to GDS the speed should be not sensitive to the data set size (unless it is very small, like a couple of MBs or less). Can you test the performance without GDS enabled?

Hi @JanuszL,
Thank you very much for your prompt response.
As you suggested, I cleared the disk cache before each measurement and tested with both CPU and GPU reader types. (I set the chunk size to 16M for the tests)
Because of clearing the cache, the results seem somewhat more consistent.
However, I am still puzzled by the outcomes of cases idx 1, idx 3, and idx 5...

idx resolution reader protocol o_direct epoch_1 epoch_2 epoch_3 epoch_4 epoch_5 mean
0 sd cpu pcie FALSE 236 18 15 13 14 59.2
1 sd gpu rdma FALSE 33 33 33 33 30 32.4
2 hd cpu pcie FALSE 510 40 28 29 26 126.6
3 hd gpu rdma FALSE 34 35 35 35 31 34
4 fhd cpu pcie FALSE 830 193 149 121 93 277.2
5 fhd gpu rdma FALSE 42 36 39 37 38 38.4
6 qhd cpu pcie FALSE 1797 1289 1162 1044 1101 1278.6
7 qhd gpu rdma FALSE 1326 1100 1063 1070 1055 1122.8
8 uhd cpu pcie FALSE 3343 3229 3071 3076 3140 3171.8
9 uhd gpu rdma FALSE 3044 2608 2558 2515 2481 2641.2

Hi @asdfry,

Can you calculate the throughput for each data set, because now as I understand you record time spent?
I see that qhd vs uhd is ~2.25 more data, and the times are also proportionally bigger (>2x times). When it comes to sd, hd and fhd maybe the loading times are hidden by python overhead. When you time DALI that way you record time spend waiting for data. As DALI does prefetching the ideal case should be close to 0 - the loading overlaps with the rest of the execution. You can try to capture the nsight profile to see how long each operator takes.

Thank you very much for your kind and prompt response to my issue.