【Inquiries on Training Issues】

Question

【Inquiries on Training Issues】

Closed this issue a year ago · 2 comments

wonlee2019 commented a year ago

Dear Googolxx,

I am retraining the pytorch implementation, and randomly choose the 150k training images & 50k testing images.

And i follow the training usage "CUDA_VISIBLE_DEVICES=0,1 python train.py -d /path/to/image/dataset/ -e 1000 --batch-size 16 --save --save_path /path/to/save/ -m stf --cuda --lambda 0.0035"

But there exists a problem :: training slowly, four hours per epoch; and "-e 1000" will cost 166 days for one lambda.

Further, i use the 4090 GPU, the training status is followed as

and one training process owns many PIDs (MAYBE this situation causes the slow training speed?)

Am i miss something or should i consider some tips/tricks ? What steps can be taken to increase training speed？

Answer 1 · 2023-08-29T08:03:11.000Z

See this issue for training details.
Actually, the 'epochs' depends on train set, so using 'iterations' is more appropriate to describe the training details.
It takes about 1.4-1.8 million iterations (140+epochs for your train set) to finish the training.
On 2080Ti GPU x 2, roughly takes 10-14 days.

Answer 2 · 2023-08-29T11:55:34.000Z

Thanks a lot!