qusetions about training

Question

qusetions about training

Closed this issue a year ago · 0 comments

I attempted training on an 8xA800 machine with the following configuration:

N_WORKERS: 8
GPUS: 8
BATCHSIZE: 8
STEPS: 50000

# OPTIMIZER:
#   ACCUMULATE_GRAD_BATCHES: 1

RECEPTIVE_FIELD: 6
FUTURE_HORIZON: 6

VAL_CHECK_INTERVAL: 3000

When the iterations are proceeding normally, the CPU utilization appears as follows:

When there are no iterations happening, the CPU utilization looks like this:

My training time is close to 200 hours, but you mentioned that training with 8xV100 should only take two days. This has left me quite puzzled. My conclusion is that the current training bottleneck appears to be in the dataset processing part. When I set n_worker to 8, the progress bar gets stuck for a long time every 8 iterations, and GPU utilization drops to 0. Could there be any improper settings on my end? I look forward to your response.

(Note: I left the "OPTIMIZER" section commented out since it wasn't included in your original text.)