Numpy reader test (GDS)

Question

Numpy reader test (GDS)

asdfry opened this issue 3 months ago · 4 comments

Describe the question.

Specifications of the test server

GPU: A100 80GB * 8
System Memory: 2TB

Hello, I am comparing the time it takes to complete one epoch by adjusting the arguments of the numpy reader.
While analyzing the results, I encountered some parts that I do not understand and would like to ask about them.

When DALI_GDS_CHUNK_SIZE is set to 16M, the results seem to mimic the effect of using cache.

chunk_size	epoch_1	epoch_2	epoch_3	epoch_4	epoch_5	mean
1M	647	653	664	666	658	657.6
16M	63	39	38	38	37	43

The speed for first epoch is faster when the size of the dataset is larger. (HD: 859G, FHD: 1.9T)

dataset_resolution	epoch_1	epoch_2	epoch_3	epoch_4	epoch_5	mean
HD	360	35	33	33	29	98
FHD	63	39	38	38	37	43

Here is the source code I used, which was based on an example.

@pipeline_def
def create_dali_pipeline(
    data_dir,
    dali_cpu,
    o_direct,
    prefetch_queue,
    seed,
    shard_id,
    num_shards,
):
    images = fn.readers.numpy(
        device="cpu" if dali_cpu else "gpu",
        file_root=data_dir,
        file_filter="*.image.npy",
        pad_last_batch=True,
        shuffle_after_epoch=True,
        dont_use_mmap=True,
        use_o_direct=o_direct,
        prefetch_queue_depth=prefetch_queue,
        seed=seed,
        shard_id=shard_id,
        num_shards=num_shards,
        name="Reader",
    )
    labels = fn.readers.numpy(
        device="cpu" if dali_cpu else "gpu",
        file_root=data_dir,
        file_filter="*.label.npy",
        pad_last_batch=True,
        shuffle_after_epoch=True,
        dont_use_mmap=True,
        use_o_direct=o_direct,
        prefetch_queue_depth=prefetch_queue,
        seed=seed,
        shard_id=shard_id,
        num_shards=num_shards,
    )
    return images, labels

def main(args):
    os.environ["DALI_GDS_CHUNK_SIZE"] = args.chunk_size

    train_pipe = create_dali_pipeline(
        batch_size=args.batch_size,
        device_id=args.local_rank,
        num_threads=4,
        data_dir=args.data_dir,
        dali_cpu=args.dali_cpu,
        o_direct=args.o_direct,
        prefetch_queue=args.prefetch_queue,
        seed=12 + args.local_rank,
        shard_id=args.local_rank,
        num_shards=args.world_size,
    )

    if args.local_rank == 0:
        logger.info("Building training pipeline")
    train_pipe.build()

    train_loader = DALIClassificationIterator(
        train_pipe,
        reader_name="Reader",
        last_batch_policy=LastBatchPolicy.PARTIAL,
        auto_reset=True,
    )

    log_interval = len(train_loader) // 20
    for epoch in range(5):
        start = time.time()
        for step, batch in enumerate(train_loader):
            if (args.local_rank == 0) and (log_interval > 0) and (step % log_interval == log_interval - 1):
                logger.info(f"[epoch {epoch+1}] step: {step+1} / {len(train_loader)}")
        if args.local_rank == 0:
            logger.info(f"[epoch {epoch+1}] time: {time.time() - start}")

Could you explain the reasons behind these observations?

Check for duplicates

I have searched the open bugs/issues and have found no duplicates for this bug report

Answer 1 · 2024-04-09T08:27:57.000Z

Hi @asdfry,

Thank you for reaching out.
I have a few questions regarding your measurements:

do you clean disc caches before each measurement? Do you see any impact of this action on your numbers?
Based on the measurements documented at #3972 (comment) the performance difference is usually observed with much smaller chunk sizes. The one you use should provide similar perf. Can you measure other sizes as well?
When it comes to GDS the speed should be not sensitive to the data set size (unless it is very small, like a couple of MBs or less). Can you test the performance without GDS enabled?

Answer 2 · 2024-04-10T10:11:02.000Z

Hi @JanuszL,
Thank you very much for your prompt response.
As you suggested, I cleared the disk cache before each measurement and tested with both CPU and GPU reader types. (I set the chunk size to 16M for the tests)
Because of clearing the cache, the results seem somewhat more consistent.
However, I am still puzzled by the outcomes of cases idx 1, idx 3, and idx 5...

idx	resolution	reader	protocol	o_direct	epoch_1	epoch_2	epoch_3	epoch_4	epoch_5	mean
0	sd	cpu	pcie	FALSE	236	18	15	13	14	59.2
1	sd	gpu	rdma	FALSE	33	33	33	33	30	32.4
2	hd	cpu	pcie	FALSE	510	40	28	29	26	126.6
3	hd	gpu	rdma	FALSE	34	35	35	35	31	34
4	fhd	cpu	pcie	FALSE	830	193	149	121	93	277.2
5	fhd	gpu	rdma	FALSE	42	36	39	37	38	38.4
6	qhd	cpu	pcie	FALSE	1797	1289	1162	1044	1101	1278.6
7	qhd	gpu	rdma	FALSE	1326	1100	1063	1070	1055	1122.8
8	uhd	cpu	pcie	FALSE	3343	3229	3071	3076	3140	3171.8
9	uhd	gpu	rdma	FALSE	3044	2608	2558	2515	2481	2641.2

Answer 3 · 2024-04-10T12:32:07.000Z

Hi @asdfry,

Can you calculate the throughput for each data set, because now as I understand you record time spent?
I see that qhd vs uhd is ~2.25 more data, and the times are also proportionally bigger (>2x times). When it comes to sd, hd and fhd maybe the loading times are hidden by python overhead. When you time DALI that way you record time spend waiting for data. As DALI does prefetching the ideal case should be close to 0 - the loading overlaps with the rest of the execution. You can try to capture the nsight profile to see how long each operator takes.

Answer 4 · 2024-04-11T12:26:21.000Z

Thank you very much for your kind and prompt response to my issue.