eloialonso/iris

Actor Critic training time

Recoan0 opened this issue · 1 comments

Hi,

I am running IRIS on a cluster with 4 A100 GPUs. Using 1 GPU, training the Actor Critic for 1 epoch, 200 steps takes me around 25 minutes which is a long time. I tried parallelising the code with 4 GPUs using torch DDP, but this only slows down the AC training to 56 minutes. When profiling the code to find out what is taking so long I have come to the conclusion that the world_model_env within the imagine function of the actor critic is taking almost all of this time in both single gpu as well as ddp training:

ac_train_slow

Is it normal for the AC to train for 25 minutes per epoch? Is there a way to speed this up, either on single GPU or parallel training?

I am using a custom Carla Gym env with observation sizes 64 x 64 x 3.

Thank you.

Hi,

On our side, one epoch of AC training with an A100 GPU takes around 11 minutes.

Based on your PyTorch version (>=1.12), you may need to add the line torch.backends.cuda.matmul.allow_tf32 = True in the __init__ function of the Trainer class.

def __init__(self, cfg: DictConfig) -> None:

See this doc for explanations.

If that does not fix your issue, could you try running IRIS with the default configuration on Breakout and report the training time for the AC?