How to resume training if interrupted

Question

How to resume training if interrupted

imran7778 opened this issue 7 years ago · 4 comments

I need help in how can i resume training if system shutdown or training interrupts.
My training stopped due to system shutdown after that i execute the following command:
bash train_segan.sh
Its normally start and load checkpoints successfully but start training from zero not from previously saved checkpoints.
Please guide me how can i resume training.
Thanks

Answer 1 · 2018-04-06T07:34:14.000Z

Hi @imran7778 ,

the latest checkpoint in the dir should load succesfully without any further work. Is it possible that the checkpoint is corrupt? Try modifying the 'checkpoint' text file within the directory to change the pointer to the latest but one file, thus telling TF to load a prior ckpt version.
I'm not sure If I undrestand, however, what do you mean by normally start and load checkpoints successfully but start training from zero , how do you know it starts from zero? (I understand you've seen the verbose of [*] Load SUCCESFULLY).

Regards

Answer 2 · 2018-04-10T07:25:23.000Z

Dear @santi-pdp

Thank for your reply. here is the screen shot that may clear my point.

For the first time when i start training it give me the following output:

bash train_segan.sh
2018-04-10 10:04:24.617249: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-04-10 10:04:24.617354: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
Parsed arguments: {'z_depth': 256, 'l1_remove_epoch': 150, 'batch_size': 3, 'model': 'gan', 'init_l1_weight': 10.0, 'g_learning_rate': 0.0002, 'seed': 111, 'z_dim': 256, 'save_freq': 10, 'noise_decay': 0.7, 'denoise_epoch': 5, 'synthesis_path': 'dwavegan_samples', 'd_label_smooth': 0.25, 'weights': None, 'denoise_lbound': 0.01, 'epoch': 150, 'd_learning_rate': 0.0002, 'save_path': 'segan_v1', 'beta_1': 0.5, 'init_noise_std': 0.0, 'test_wav': None, 'e2e_dataset': 'data/segan.tfrecords', 'save_clean_path': 'test_clean_results', 'canvas_size': 16384, 'g_nl': 'prelu', 'g_type': 'ae'}
Using device: /cpu:0
Creating GAN model
*** Building Generator ***
Downconv (3, 16384, 1) -> (3, 8192, 16)
Adding skip connection downconv 0
-- Enc: prelu activation --
Downconv (3, 8192, 16) -> (3, 4096, 32)
.
.
.
.
Amount of alpha vectors: 21
Amount of skip connections: 10
Last wave shape: (3, 16384, 1)

num of G returned: 23
*** Discriminator summary ***
D block 0 input shape: (3, 16384, 2) *** downconved shape: (3, 8192, 16) *** Applying VBN *** Applying Lrelu ***
.
.
.
D block 10 input shape: (3, 16, 512) *** downconved shape: (3, 8, 1024) *** Applying VBN *** Applying Lrelu ***
discriminator deconved shape: (3, 8, 1024)
discriminator output shape: (3, 1)

Not clipping D weights
Initializing optimizers...
Initializing variables...
Sampling some wavs to store sample references...
sample noisy shape: (3, 16384)
sample wav shape: (3, 16384)
sample z shape: (3, 8, 1024)
total examples in TFRecords data/segan.tfrecords: 360
Batches per epoch: 120.0
[*] Reading checkpoints...
[!] Load failed
0/18000.0 (epoch 0), d_rl_loss = 1.42159, d_fk_loss = 0.02565, g_adv_loss = 5.51244, g_l1_loss = 6.08547, time/batch = 12.21935, mtime/batch = 12.21935
1/18000.0 (epoch 0), d_rl_loss = 1.40727, d_fk_loss = 10.28780, g_adv_loss = 2.06019, g_l1_loss = 5.75486, time/batch = 11.97167, mtime/batch = 12.09551
2/18000.0 (epoch 0), d_rl_loss = 5.54344, d_fk_loss = 9.00089, g_adv_loss = 5.41440, g_l1_loss = 6.22119, time/batch = 10.84464, mtime/batch = 11.67856
3/18000.0 (epoch 0), d_rl_loss = 2.56064, d_fk_loss = 0.67524, g_adv_loss = 110.04749, g_l1_loss = 5.88563, time/batch = 11.98766, mtime/batch = 11.75583
4/18000.0 (epoch 0), d_rl_loss = 43.09314, d_fk_loss = 32.41562, g_adv_loss = 18.53921, g_l1_loss = 6.13015, time/batch = 11.27476, mtime/batch = 11.65962
.
.
.
9/18000.0 (epoch 0), d_rl_loss = 16.02569, d_fk_loss = 12.40006, g_adv_loss = 8.71034, g_l1_loss = 5.61963, time/batch = 12.91840, mtime/batch = 11.64647
w0 max: 0.06945234537124634 min: -0.06775650382041931
w1 max: 0.051821060478687286 min: -0.04958131164312363
w2 max: 0.0637265294790268 min: -0.061875924468040466
10/18000.0 (epoch 0), d_rl_loss = 10.47512, d_fk_loss = 5.93869, g_adv_loss = 10.88952, g_l1_loss = 5.71833, time/batch = 11.29298, mtime/batch = 11.61434
11/18000.0 (epoch 0), d_rl_loss = 4.90630, d_fk_loss = 1.85100, g_adv_loss = 7.53411, g_l1_loss = 5.72742, time/batch = 11.91929, mtime/batch = 11.63975
12/18000.0 (epoch 0), d_rl_loss = 2.07515, d_fk_loss = 1.90992, g_adv_loss = 7.60952, g_l1_loss = 6.65654, time/batch = 13.04373, mtime/batch = 11.74775
13/18000.0 (epoch 0), d_rl_loss = 3.69959, d_fk_loss = 6.78575, g_adv_loss = 2.97328, g_l1_loss = 5.80335, time/batch = 11.46316, mtime/batch = 11.72742
14/18000.0 (epoch 0), d_rl_loss = 0.48384, d_fk_loss = 1.33486, g_adv_loss = 2.08532, g_l1_loss = 5.95979, time/batch = 12.65085, mtime/batch = 11.78898
.
.
.
.

64/18000.0 (epoch 0), d_rl_loss = 0.14060, d_fk_loss = 0.06874, g_adv_loss = 0.48891, g_l1_loss = 6.13891, time/batch = 10.49836, mtime/batch = 11.53807
65/18000.0 (epoch 0), d_rl_loss = 0.12317, d_fk_loss = 0.05944, g_adv_loss = 1.03536, g_l1_loss = 4.85944, time/batch = 10.57506, mtime/batch = 11.52348
66/18000.0 (epoch 0), d_rl_loss = 0.20725, d_fk_loss = 0.19382, g_adv_loss = 1.22923, g_l1_loss = 4.45759, time/batch = 10.51389, mtime/batch = 11.50841
67/18000.0 (epoch 0), d_rl_loss = 0.06127, d_fk_loss = 0.01148, g_adv_loss = 0.97544, g_l1_loss = 4.50832, time/batch = 10.57977, mtime/batch = 11.49475
68/18000.0 (epoch 0), d_rl_loss = 0.09463, d_fk_loss = 0.06658, g_adv_loss = 0.54611, g_l1_loss = 4.63855, time/batch = 11.85356, mtime/batch = 11.49995
69/18000.0 (epoch 0), d_rl_loss = 0.49186, d_fk_loss = 0.22236, g_adv_loss = 0.57460, g_l1_loss = 3.07398, time/batch = 11.27534, mtime/batch = 11.49674
w0 max: 0.995848536491394 min: 0.0767320990562439
w1 max: 0.9888091087341309 min: 0.008043618872761726
w2 max: 0.9928516149520874 min: 0.041960734874010086
70/18000.0 (epoch 0), d_rl_loss = 0.03219, d_fk_loss = 0.09166, g_adv_loss = 0.54423, g_l1_loss = 6.05527, time/batch = 12.04599, mtime/batch = 11.50448
^C
2018-04-10 10:21:41.574188: W tensorflow/core/kernels/queue_base.cc:294] _2_device_0/input_producer: Skipping cancelled enqueue attempt with queue not closed
Traceback (most recent call last):
return fn(*args)
File "/home/imran/miniconda2/envs/ten/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1021, in _run_fn
status, run_metadata)
KeyboardInterrupt

After iteration Number 70/18000 i have interrupt the training myself. My save path looks like..

and check point txt file looks like...

model_checkpoint_path: "SEGAN-70"
all_model_checkpoint_paths: "SEGAN-30"
all_model_checkpoint_paths: "SEGAN-40"
all_model_checkpoint_paths: "SEGAN-50"
all_model_checkpoint_paths: "SEGAN-60"
all_model_checkpoint_paths: "SEGAN-70"

Now i have restart the training and expected to resume training from iteration 70/18000. but its start from iteration 0/18000. you can see in the following output:

bash train_segan.sh
2018-04-10 11:26:09.351613: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-04-10 11:26:09.351762: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
Parsed arguments: {'batch_size': 3, 'epoch': 150, 'd_learning_rate': 0.0002, 'save_clean_path': 'test_clean_results', 'model': 'gan', 'g_type': 'ae', 'denoise_epoch': 5, 'z_dim': 256, 'beta_1': 0.5, 'd_label_smooth': 0.25, 'g_learning_rate': 0.0002, 'canvas_size': 16384, 'weights': None, 'seed': 111, 'z_depth': 256, 'save_path': 'segan_v1', 'l1_remove_epoch': 150, 'e2e_dataset': 'data/segan.tfrecords', 'test_wav': None, 'init_l1_weight': 10.0, 'denoise_lbound': 0.01, 'synthesis_path': 'dwavegan_samples', 'g_nl': 'prelu', 'save_freq': 10, 'noise_decay': 0.7, 'init_noise_std': 0.0}
Using device: /cpu:0
Creating GAN model
*** Building Generator ***
Downconv (3, 16384, 1) -> (3, 8192, 16)
.
.
.
.
Not clipping D weights
Initializing optimizers...
Initializing variables...
Sampling some wavs to store sample references...
sample noisy shape: (3, 16384)
sample wav shape: (3, 16384)
sample z shape: (3, 8, 1024)
total examples in TFRecords data/segan.tfrecords: 360
Batches per epoch: 120.0
[*] Reading checkpoints...
[*] Read SEGAN-70
[*] Load SUCCESS
0/18000.0 (epoch 0), d_rl_loss = 0.02655, d_fk_loss = 0.27790, g_adv_loss = 1.29016, g_l1_loss = 4.26079, time/batch = 12.79920, mtime/batch = 12.79920
1/18000.0 (epoch 0), d_rl_loss = 0.08352, d_fk_loss = 0.03772, g_adv_loss = 0.44378, g_l1_loss = 5.76775, time/batch = 12.14758, mtime/batch = 12.47339
2/18000.0 (epoch 0), d_rl_loss = 0.15646, d_fk_loss = 0.02151, g_adv_loss = 1.40255, g_l1_loss = 3.38811, time/batch = 11.16680, mtime/batch = 12.03786
3/18000.0 (epoch 0), d_rl_loss = 0.04816, d_fk_loss = 0.29102, g_adv_loss = 0.99367, g_l1_loss = 6.05134, time/batch = 11.06146, mtime/batch = 11.79376
4/18000.0 (epoch 0), d_rl_loss = 0.13729, d_fk_loss = 0.17743, g_adv_loss = 1.32933, g_l1_loss = 4.26389, time/batch = 11.02163, mtime/batch = 11.63933
5/18000.0 (epoch 0), d_rl_loss = 0.19347, d_fk_loss = 0.04417, g_adv_loss = 0.68631, g_l1_loss = 4.05842, time/batch = 11.03287, mtime/batch = 11.53826
6/18000.0 (epoch 0), d_rl_loss = 0.10904, d_fk_loss = 0.00201, g_adv_loss = 1.50521, g_l1_loss = 5.40282, time/batch = 11.76548, mtime/batch = 11.57072
.
.
.
.
28/18000.0 (epoch 0), d_rl_loss = 0.09597, d_fk_loss = 0.06830, g_adv_loss = 0.28432, g_l1_loss = 3.43519, time/batch = 18.37957, mtime/batch = 12.22184
29/18000.0 (epoch 0), d_rl_loss = 0.37213, d_fk_loss = 0.05943, g_adv_loss = 0.75423, g_l1_loss = 3.72193, time/batch = 11.14937, mtime/batch = 12.18609
w0 max: 0.5334341526031494 min: -0.27949172258377075
w1 max: 0.867225170135498 min: -0.07362376898527145
w2 max: 0.916520357131958 min: 0.20332744717597961
30/18000.0 (epoch 0), d_rl_loss = 0.17321, d_fk_loss = 0.30142, g_adv_loss = 0.30572, g_l1_loss = 3.94730, time/batch = 11.06777, mtime/batch = 12.15002
31/18000.0 (epoch 0), d_rl_loss = 0.01438, d_fk_loss = 0.06208, g_adv_loss = 0.78712, g_l1_loss = 2.60531, time/batch = 12.31825, mtime/batch = 12.15527
32/18000.0 (epoch 0), d_rl_loss = 0.12803, d_fk_loss = 0.06517, g_adv_loss = 0.78155, g_l1_loss = 4.04289, time/batch = 11.45150, mtime/batch = 12.13395
^C
2018-04-10 11:34:56.431348: W tensorflow/core/kernels/queue_base.cc:294] _2_device_0/input_producer: Skipping cancelled enqueue attempt with queue not closed
Traceback (most recent call last):
status, run_metadata)
KeyboardInterrupt

After restart traning my chectpoint txt file is also changed you can see below:

model_checkpoint_path: "SEGAN-30"
all_model_checkpoint_paths: "SEGAN-10"
all_model_checkpoint_paths: "SEGAN-20"
all_model_checkpoint_paths: "SEGAN-30"

This is not my real training, the actual training contain big dataset and it was stopped at iteration 76000/90000 due to system shutdown after 3 days of continuous training. I know when i will restart training it will began from iteration 0/90000. Please help to how can i resume it...

Thanks

Answer 3 · 2018-05-03T10:58:09.000Z

I am also facing same issue. After interruption, when I tried to retrain the model it is showing "LOAD SUCCESSFUL" but started with epoch 0/iteration 0. Please suggest any possible solution #46
The trained model is not loaded in code though it shows "LOAD SUCCESSFUL"

Answer 4 · 2019-03-05T01:44:34.000Z

After the final training, how much are d_rl_loss,d_fk_lossg_adv_loss and g_l1_loss respectively?I found that the loss of the training discriminator is very small, basically about 0.0005