How to achieve a good time per epoch?

Question

How to achieve a good time per epoch?

Zadagu opened this issue 4 years ago · 0 comments

A Question to the community: On which Hardware do you achieve the best time per Epoch during training?

I tried following machines (Tensorflow 2.1 as backend)

i7-7700K CPU
64GB RAM
-> ~1200s / 20min per Epoch
dual Xeon CPU E5-2620 v3 @ 2.40GHz
42GB RAM
-> ~4800s / 80min per Epoch (custom build tensorflow to use all CPU features, but that didn't increased performance a lot)
AWS ml.p2.xlarge
Nvidia K80
61GB RAM
-> ~ 1892s per Epoch (Tensorflow 2.1, custom docker image based on tensorflow:2.1.0-gpu-py3)
AWS ml.m5.4xlarge
16 vCPUs
64GB RAM
-> ~7000s per Epoch (Tensorflow 1.13, original sagemaker docker image)
AWS ml.c5.9xlarge
36 vCPUs
72GB RAM
-> ~7000s per Epoch (Tensorflow 1.13, original sagemaker docker image)

To enable full CuDNN support I also created a model with only tanh as recurrent activation.
This decreased the time per epoch (ml.p2.xlarge -> ~420s), but the trained model performed far less.

Do I miss anything? Is there a way to accelerate the training speed?