ETA for training
amughrabi opened this issue · 2 comments
Hi team,
I would like to know if there is a way to know the estimated time for the lumi train -c config.yaml
method. AFAICT, I created a dataset contains 7809 images (coco style), and the config.yaml is
train:
# Run name for the training session.
run_name: my_run
job_dir: jobs
learning_rate:
decay_method: piecewise_constant
# Custom dataset for Luminoth Tutorial
boundaries: [90000, 160000, 250000]
values: [0.0003, 0.0001, 0.00003, 0.00001]
dataset:
type: object_detection
dir: tf
model:
type: fasterrcnn
network:
num_classes: 13
I run the training on CPU (Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
), and here is a sample output
INFO:tensorflow:Saving checkpoints for 20689 into jobs/my_run/model.ckpt.
INFO:tensorflow:step: 20689, file: JPEGImages/cap198.jpg, train_loss: 1.96116948128, in 20.54s
INFO:tensorflow:step: 20690, file: JPEGImages/cap541.jpg, train_loss: 2.02766418457, in 27.07s
INFO:tensorflow:step: 20691, file: JPEGImages/cap220.jpg, train_loss: 1.91760718822, in 24.67s
INFO:tensorflow:step: 20692, file: JPEGImages/cap299.jpg, train_loss: 2.05084943771, in 21.31s
It is running before 4 days ago, is this expected and how can I know how much remaining (if possible)?
Thanks a million,
Mughrabi
Instead of using CPU, I used GCloud based on the https://luminoth.readthedocs.io/en/latest/usage/cloud.html. It is running from 2 days ago.
I 2019-09-16T19:41:32.707196950Z master-replica-0 [master-0] - step: 138172, file: JPEGImages/cap036.jpg, train_loss: 1.38995575905, in 1.12s master-replica-0
I 2019-09-16T19:41:33.865154027Z master-replica-0 [master-0] - step: 138173, file: JPEGImages/cap624.jpg, train_loss: 1.34886920452, in 1.16s master-replica-0
Can you please advise how to know the remaining time for the training to be completed? am I falling in an infinite loop?
Can you please advise how to know the remaining time for the training to be completed? am I falling in an infinite loop?
I found a way to compute it!
- Assume that the batch_size is
B
, the average time spent for 1 step isS
, the number of epochs is 'P' in seconds. - The initial ETA is
(P / B) * S
in seconds.