How to achieve a good time per epoch?
Zadagu opened this issue · 0 comments
Zadagu commented
A Question to the community: On which Hardware do you achieve the best time per Epoch during training?
I tried following machines (Tensorflow 2.1 as backend)
- i7-7700K CPU
64GB RAM
-> ~1200s / 20min per Epoch - dual Xeon CPU E5-2620 v3 @ 2.40GHz
42GB RAM
-> ~4800s / 80min per Epoch (custom build tensorflow to use all CPU features, but that didn't increased performance a lot) - AWS
ml.p2.xlarge
Nvidia K80
61GB RAM
-> ~ 1892s per Epoch (Tensorflow 2.1, custom docker image based on tensorflow:2.1.0-gpu-py3) - AWS
ml.m5.4xlarge
16 vCPUs
64GB RAM
-> ~7000s per Epoch (Tensorflow 1.13, original sagemaker docker image) - AWS
ml.c5.9xlarge
36 vCPUs
72GB RAM
-> ~7000s per Epoch (Tensorflow 1.13, original sagemaker docker image)
To enable full CuDNN support I also created a model with only tanh as recurrent activation.
This decreased the time per epoch (ml.p2.xlarge
-> ~420s), but the trained model performed far less.
Do I miss anything? Is there a way to accelerate the training speed?