Googolxx/STF

【Inquiries on Training Issues】

Closed this issue · 2 comments

Dear Googolxx,

I am retraining the pytorch implementation, and randomly choose the 150k training images & 50k testing images.

And i follow the training usage "CUDA_VISIBLE_DEVICES=0,1 python train.py -d /path/to/image/dataset/ -e 1000 --batch-size 16 --save --save_path /path/to/save/ -m stf --cuda --lambda 0.0035"

But there exists a problem :: training slowly, four hours per epoch; and "-e 1000" will cost 166 days for one lambda.

Further, i use the 4090 GPU, the training status is followed as
image
image
and one training process owns many PIDs (MAYBE this situation causes the slow training speed?)
image

Am i miss something or should i consider some tips/tricks ? What steps can be taken to increase training speed?

See this issue for training details.
Actually, the 'epochs' depends on train set, so using 'iterations' is more appropriate to describe the training details.
It takes about 1.4-1.8 million iterations (140+epochs for your train set) to finish the training.
On 2080Ti GPU x 2, roughly takes 10-14 days.

Thanks a lot!