How long does it take for the training?

Question

How long does it take for the training?

moon5756 opened this issue 5 years ago · 2 comments

Hi, thanks for the really helpful work.

I just wonder how long it took for the training.
My desktop has the following cpu and gpu.
cpu: Intel Core i7-6900K CPU @ 3.2GHz
SSD: Sanmsung SSD 850 EVO
gpu: NVIDIA GeForce RTX 2080 TI

I ran the training script and it says active GPUs: 0, from which I can tell the my GPU is properly processing. I changed the size of batch to 50 in config.json because it complained about OOM issue.

I ran the script for about 23 min and it only completed one epoch.
One concern is that the utilization of CPU is like 99% but utilization of GPU is less than 10%.
Any configuration I need to change to fully utilize GPU?
Following is the command line log.

$ python train.py --config configs/config.json -g 0
=> active GPUs: 0
=> Output folder for this run -- jester_conv6
Using 9 processes for data loader.
Training is getting started...
Training takes 999999 epochs.
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
Epoch: [0][0/2371]      Loss 3.3603 (3.3603)    Prec@1 2.000 (2.000)    Prec@5 24.000 (24.000)
Epoch: [0][100/2371]    Loss 3.3065 (3.3294)    Prec@1 8.000 (5.267)    Prec@5 28.000 (21.010)
Epoch: [0][200/2371]    Loss 3.4034 (3.3176)    Prec@1 6.000 (6.179)    Prec@5 16.000 (21.980)
Epoch: [0][300/2371]    Loss 3.3358 (3.3123)    Prec@1 12.000 (6.698)   Prec@5 20.000 (22.213)
Epoch: [0][400/2371]    Loss 3.2839 (3.3080)    Prec@1 10.000 (7.137)   Prec@5 20.000 (22.339)
Epoch: [0][500/2371]    Loss 3.2690 (3.3068)    Prec@1 12.000 (7.246)   Prec@5 28.000 (22.367)
Epoch: [0][600/2371]    Loss 3.3679 (3.3045)    Prec@1 6.000 (7.384)    Prec@5 22.000 (22.326)
Epoch: [0][700/2371]    Loss 3.3639 (3.3040)    Prec@1 6.000 (7.387)    Prec@5 14.000 (22.397)
Epoch: [0][800/2371]    Loss 3.2118 (3.3035)    Prec@1 8.000 (7.366)    Prec@5 36.000 (22.429)
Epoch: [0][900/2371]    Loss 3.3153 (3.3017)    Prec@1 2.000 (7.478)    Prec@5 24.000 (22.562)
Epoch: [0][1000/2371]   Loss 3.3295 (3.3003)    Prec@1 4.000 (7.538)    Prec@5 16.000 (22.691)
Epoch: [0][1100/2371]   Loss 3.2486 (3.2990)    Prec@1 10.000 (7.599)   Prec@5 30.000 (22.874)
Epoch: [0][1200/2371]   Loss 3.3112 (3.2973)    Prec@1 6.000 (7.607)    Prec@5 14.000 (22.981)
Epoch: [0][1300/2371]   Loss 3.2315 (3.2960)    Prec@1 14.000 (7.631)   Prec@5 36.000 (23.148)
Epoch: [0][1400/2371]   Loss 3.3065 (3.2944)    Prec@1 4.000 (7.659)    Prec@5 26.000 (23.269)
Epoch: [0][1500/2371]   Loss 3.2688 (3.2931)    Prec@1 12.000 (7.695)   Prec@5 34.000 (23.387)
Epoch: [0][1600/2371]   Loss 3.1971 (3.2921)    Prec@1 12.000 (7.734)   Prec@5 40.000 (23.492)
Epoch: [0][1700/2371]   Loss 3.2873 (3.2908)    Prec@1 8.000 (7.790)    Prec@5 20.000 (23.588)
Epoch: [0][1800/2371]   Loss 3.1563 (3.2894)    Prec@1 16.000 (7.842)   Prec@5 42.000 (23.719)
Epoch: [0][1900/2371]   Loss 3.2181 (3.2875)    Prec@1 8.000 (7.883)    Prec@5 36.000 (23.916)
Epoch: [0][2000/2371]   Loss 3.2744 (3.2859)    Prec@1 4.000 (7.929)    Prec@5 18.000 (24.034)
Epoch: [0][2100/2371]   Loss 3.3153 (3.2836)    Prec@1 6.000 (7.952)    Prec@5 28.000 (24.207)
Epoch: [0][2200/2371]   Loss 3.1725 (3.2810)    Prec@1 12.000 (8.038)   Prec@5 36.000 (24.462)
Epoch: [0][2300/2371]   Loss 3.2124 (3.2788)    Prec@1 8.000 (8.044)    Prec@5 38.000 (24.708)
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
Test: [0/296]   Loss 3.2033 (3.2033)    Prec@1 14.000 (14.000)  Prec@5 32.000 (32.000)

// EDIT: Wait a sec... I just checked the tensorboard.. and is it supposed to take more than 1 day?

Answer 1 · 2020-03-05T09:59:06.000Z

First, you hardware configuration seems quite good. When we work with any 3d network, it is important. For the training configuration, I remember that I used 3 GPU, so it is normal that the initial batch size doesn't fit into 1 single GPU. My advice is to allow make sure the batch size is always greater than 32 (it make sure the statistics you capture help the model to generalize well). Another advice is to use apex library for mix precision training (https://github.com/NVIDIA/apex). The library wasn't mature enough when I did this project.

Another important thing is the loading. If you have SSD, make sure your data are stored on it (it greatly increase the loading speed). A poor loading leads to GPU starvation, which slow down the training process.

The training could take several days (depending on your configuration).

If you like you like this project, feel free to leave a star. (it is my only reward ^^)

Answer 2 · 2020-03-06T04:47:19.000Z

Thanks for the prompt response.
Based on the tensorboard, it seems like yours also took like a day and 14hrs for 350 epochs. Thanks for the suggestion though. I already left the star.