sands-lab/omnireduce

Forward Pass Really Slow.

ertza opened this issue · 7 comments

ertza commented

Hi I have tried 2 different models, both with and without OFFLOAD_BITMAP, on a A100 GPU with a single worker single PS in colocated mode. I see that I reach 100% CPU utilization and close to 0% GPU utilization while the training forward pass gets stuck on x = self.relu(self.conv1(x)) for minutes. I reconfirmed x is a cuda tensor. Any idea why this might be? Any changes which might affect PyTorch's forward pass or underlying libraries?

How large are these two models? Did you check the GPU memory in forward pass?
OmniReduce allocates the communication buffer on GPU, which may influence the performance.
Other than that, I don't think OmniReduce will affect the forward pass performance as it is only called during backward pass.

ertza commented

One was ResNet50 but the other was a toy model, Lenet5, and in any case it gets stuck on the first x = self.relu(self.conv1(x)), and if I use vanilla Pytorch even 10 epochs of training get completed in a few seconds. I'll try to debug more to figure out why this is happening here and will let you know as well.

ertza commented

Hi I was able to solve it, it was due to some conflicting packages that got installed with omnireduce with the newer versions already in my system.

I am now stuck in the backward pass. Looking at the daiet debug log I see RX queue stats mapping error, can that be a reason for this? I already increased the huge pages to 4GB as suggested previously. About my setup I am trying to run in colocated mode with 2 workers. Any idea how to go about debugging from here?

Compiled at Jul 31 2022, 02:17:09.
Configuration file daiet-xx.cfg
** DAIET parameters **
Num updates: 64
Max num pending messages: 64
Worker port: 4000
PS port: 3030
Worker IP: 14.207.254.149
PS0: XXX 14.207.254.149
PS1: YYY 14.207.254.165
Num workers: 2
Number of threads: 4
Workers: 2, PS: 2
CPU freq: 1.8 GHz
Driver: net_mlx5
RX buffer min size: 32
RX queues max number: 1024
TX queues max number: 1024
Per-port RX offload capabilities: 0x0000000000096a1f
Per-port TX offload capabilities: 0x00000000000c96af
Per-queue RX offload capabilities: 0x000000000009681f
Per-queue TX offload capabilities: 0x0000000000000000
RX descriptors limits: [0,65535] aligned: 1
TX descriptors limits: [0,65535] aligned: 1
Initializing port 0...
RX IPv4 checksum offload enabled
RX UDP checksum offload enabled
TX IPv4 checksum offload enabled
TX UDP checksum offload enabled
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
RX queue stats mapping error Unknown error -95
TX queue stats mapping error Unknown error -95
Initialization ended. Port 0 address: XXX
Checking link status
Link Up. Speed 200000 Mbps - Full-duplex
Worker core: 11 worker id: 1
Worker core: 11 worker id: 1
PS core: 12 PS id: 2
PS core: 12 PS id: 2
Worker core: 10 worker id: 0
Worker core: 10 worker id: 0
PS core: 13 PS id: 3
PS core: 13 PS id: 3
First burst sent (worker 0): 64/64
First burst sent (worker 1): 64/64

And then nothing happens after this.

Did you try running the benchmark? Does it work?

ertza commented

Hi no I was trying to run it in co-located mode without running separate aggregators as that's what I'm more interested in. I am trying to understand the co-located mode and I see that only a master thread gets created in StartMaster() call in daiet/src/DaietContext.cpp:26, Looking at the master at daiet/src/daiet.cpp:303 I see that it either starts the worker, or the ps, based on lcore id:

         if (dpdk_data.core_to_thread_id[rte_lcore_id()] < num_worker_threads)
            worker(dctx_ptr);
        else
            ps(&num_worker_threads);

So how are both worker and PS started because I only saw the master being called once in there? Where do the multiple threads get created for workers and PS?

Hi,
worker and PS are bind to DPDK lcores, this is controlled under the dpdk section in your config file.

[dpdk]
# Number of cores
cores = 0-3

They are launched here

            RTE_LCORE_FOREACH_SLAVE(lcore_id) {

#ifndef COLOCATED
                 rte_eal_remote_launch(worker, dctx_ptr, lcore_id);
#else
                 if (dpdk_data.core_to_thread_id[lcore_id] < num_worker_threads)
                     rte_eal_remote_launch(worker, dctx_ptr, lcore_id);
                 else
                     rte_eal_remote_launch(ps, &num_worker_threads, lcore_id);
#endif
            }

Closing the issue due to lack of response. Feel free to reopen.