tryolabs/luminoth

ETA for training

amughrabi opened this issue · 2 comments

Hi team,

I would like to know if there is a way to know the estimated time for the lumi train -c config.yaml method. AFAICT, I created a dataset contains 7809 images (coco style), and the config.yaml is

train:
  # Run name for the training session.
  run_name: my_run
  job_dir: jobs
  learning_rate:
    decay_method: piecewise_constant
    # Custom dataset for Luminoth Tutorial
    boundaries: [90000, 160000, 250000]
    values: [0.0003, 0.0001, 0.00003, 0.00001]
dataset:
  type: object_detection
  dir: tf
model:
  type: fasterrcnn
  network:
    num_classes: 13

I run the training on CPU (Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz), and here is a sample output

INFO:tensorflow:Saving checkpoints for 20689 into jobs/my_run/model.ckpt.
INFO:tensorflow:step: 20689, file: JPEGImages/cap198.jpg, train_loss: 1.96116948128, in 20.54s
INFO:tensorflow:step: 20690, file: JPEGImages/cap541.jpg, train_loss: 2.02766418457, in 27.07s
INFO:tensorflow:step: 20691, file: JPEGImages/cap220.jpg, train_loss: 1.91760718822, in 24.67s
INFO:tensorflow:step: 20692, file: JPEGImages/cap299.jpg, train_loss: 2.05084943771, in 21.31s

It is running before 4 days ago, is this expected and how can I know how much remaining (if possible)?

Thanks a million,
Mughrabi

Instead of using CPU, I used GCloud based on the https://luminoth.readthedocs.io/en/latest/usage/cloud.html. It is running from 2 days ago.

I 2019-09-16T19:41:32.707196950Z master-replica-0 [master-0] - step: 138172, file: JPEGImages/cap036.jpg, train_loss: 1.38995575905, in 1.12s master-replica-0 
I 2019-09-16T19:41:33.865154027Z master-replica-0 [master-0] - step: 138173, file: JPEGImages/cap624.jpg, train_loss: 1.34886920452, in 1.16s master-replica-0 

Can you please advise how to know the remaining time for the training to be completed? am I falling in an infinite loop?

Can you please advise how to know the remaining time for the training to be completed? am I falling in an infinite loop?

I found a way to compute it!

  • Assume that the batch_size is B, the average time spent for 1 step is S, the number of epochs is 'P' in seconds.
  • The initial ETA is (P / B) * S in seconds.