trzy/FasterRCNN

Training times too long

mnaranjorion opened this issue · 6 comments

Hello,

first of all thanks for the implementation. We are running some tests on a computer with two GPUs "GeForce RTX 3070, 7982 MiB" and we have some doubts, especially regarding the duration of each epoch in training.

  • We're running a release with the minimum configuration, asking it not to do data-augmentation and asking it to do the image caching:

python -m tf2.FasterRCNN --train --dataset-dir=./own_dataset/ --epochs=1 --learning-rate=1e-3 --save-best-to=fasterrcnn_tf2_tmp.h5 --no-augment --cache-images

we have durations per epoch of almost 2h.

  • Trying to launch it with data augmentation and without caching images:

python -m tf2.FasterRCNN --train --dataset-dir=./own_dataset/ --epochs=1 --learning-rate=1e-3 --save-best-to=fasterrcnn_tf2_tmp.h5

we get the same times per epoch.

  • If we include options like:

--debug-dir=/tmp/tf_debugger/

the duration increases to more than 8h per epoch.

Are we misconfiguring something or is it simply due to the dataset used?
Why don't we experience time improvements by removing data augmentation and including image caching?

Thank you very much!

trzy commented

This does indeed sound too long! I recall a similar issue long ago with my VGG-16 repo on Windows but not quite identical: the initial epoch was fast, subsequent epochs became unusably slow. This sounds different.

A few questions:

  1. What OS?
  2. What version of TensorFlow? Can you do a pip freeze and post the results here?
  3. Do you have access to any other systems with a 30-series GPU that you can test on?
  4. How fast does the PyTorch version run?

I wonder if this could be an issue with either the version of TF you're using or CUDA. However, I just tried it with a new Conda environment using the latest version of TensorFlow (I did a fresh pip install -r requirements.txt) on my 3090 in Windows and 2 epochs plus the final validation pass took 40 minutes (roughly 16 minutes for the first epoch, 8 minutes for the second, 13 for the validation pass, a few minutes during startup to parse the dataset).

If there is no obvious solution, I recommend filing a bug with TensorFlow. Unfortunately, last time I did this, the bug was closed after a year and by then some TF update had fixed it already.

Answering your questions:

  • The work is being done on Ubuntu 20.04
  • TF version is 2.9.1, and the pip freeze result is:
absl-py==1.2.0
astunparse==1.6.3
cachetools==5.2.0
certifi @ file:///opt/conda/conda-bld/certifi_1655968806487/work/certifi
charset-normalizer==2.1.0
flatbuffers==1.12
gast==0.4.0
google-auth==2.10.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
grpcio==1.47.0
h5py==3.7.0
idna==3.3
imageio==2.21.1
importlib-metadata==4.12.0
keras==2.9.0
Keras-Preprocessing==1.1.2
libclang==14.0.6
Markdown==3.4.1
MarkupSafe==2.1.1
numpy==1.21.6
oauthlib==3.2.0
opt-einsum==3.3.0
packaging==21.3
Pillow==9.2.0
protobuf==3.19.4
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing==3.0.9
requests==2.28.1
requests-oauthlib==1.3.1
rsa==4.9
six==1.16.0
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow-estimator==2.9.0
tensorflow-gpu==2.9.1
tensorflow-io-gcs-filesystem==0.26.0
termcolor==1.1.0
tqdm==4.64.0
typing_extensions==4.3.0
urllib3==1.26.11
Werkzeug==2.2.2
wrapt==1.14.1
zipp==3.8.1
  • Tests have been performed on a similar system, Ubuntu 20.04 with RTX 3090 GPU instead of the 3070 mentioned and similar times have been obtained.
  • Similar times are obtained with Pytorch. Pytorch package versions are as follows:
torch==1.12.1+cu113
torchaudio==0.12.1+cu113
torchvision==0.13.1+cu113

thank you very much!

trzy commented

That sounds very odd. I guess TF2 and PyTorch can be ruled out as the issue, leaving CUDA or something else to blame. I will give it a try in Ubuntu tonight (need to reboot into it when I'm done working).

Question: Is the repo and data on a network drive rather than a local drive? It seems like you might be I/O bound. Make absolutely sure that you are doing this on the local disk (in my case, I'm running on an SSD installed in an M2 slot). For example, if you are in an academic or professional environment and are doing this in a home directory (e.g., /home/your_username) on both machines, and it happens to be served remotely, that could be the problem. Perhaps try making a subdirectory in /tmp (e.g., /tmp/fasterrcnn) and put the repo and all data there.

Hello,

sorry for taking so long to reply, but I've been away.

At first I thought it might be that too, as the data lived on a NAS server and was accessed via volume share from the server where the code and GPUs reside. We backed up all the data to the SSD disk on the server where the code and GPUs reside and it still seems to be the same. In any case I have to check it again to make sure that everything is loaded correctly.

Have you been able to run the test on Ubuntu?

Thank you

trzy commented

I'm running CUDA 11.3 on Ubuntu and it still works. Did a fresh install of a PyTorch environment and tried it in my old TensorFlow environment which uses a docker image for CUDA for some reason (I haven't dared to upgrade that since the beginning of this year).

When you run the TF2 version, is the GPU actually being used? You should see the following output at the beginning:

CUDA Available : yes
GPU Available  : yes
Eager Execution: yes

If "GPU Available" is "no", then you could conceivably see hours-long epoch times. PyTorch is the easiest to get running with GPU. Make sure to follow the last step and use the web site to obtain the exact package list to install. TensorFlow is trickier. I recall having to use an Nvidia docker for CUDA support. Otherwise, if I run it outside of that docker container, TF2 thinks that CUDA is available but has no GPU access.

So I think the potential culprits are:

  • Disk I/O (make sure you are not accidentally fetching the dataset over the network).
  • GPU not being engaged due to some environment configuration issue.

These days I work in Windows but hopefully we can get this issue resolved because I know most people still torture themselves with Linux ;)

trzy commented

Closed due to inactivity.