Inconstant Performance with Adam

Question

Inconstant Performance with Adam

zongyuange opened this issue 7 years ago · 14 comments

Dear GIT owner, thanks for the excellent tutorial.

My issue is I got relatively low performance 23.2% on the validation set exactly with your code. However, if I switch from Adam to GradientDescent optimiser this problem vanishs. Do you have any clue why this is happening?
If I change "is_training==True" during testing in: logits, end_points = inception.inception_resnet_v2(
images,
num_classes = dataset.num_classes,
is_training = True)
Then this problem vanishes. But I don't think appropriate.
My system is Ubuntu 14.04 with TF 1.1.0

Answer 1 · 2017-08-17T13:31:58.000Z

When you mentioned the low performance, do you mean training from scratch with the dataset? Have you restored the pre-trained weights or have you only restored a few layers? What do you mean by 'vanish'? Did the accuracy suddenly become very high?

Some possible reasons:

The model wasn't trained well enough to get a high generalization ability.
The trained weights are not restored for the validation code. Make sure you restored all the same weights (because the tensor shapes should be the same in the training and evaluation).

It would be hard to further diagnose the problem unless I can see what went wrong in the code (or otherwise). Can you post the code in a Gist?

Answer 2 · 2017-08-17T23:47:22.000Z

I have been using the pretrained weights of "inception_resnet_v2_2016_08_30.ckpt".
The code I am using is exactly the code you provided, for train and test.

The training accuracy is good, after two epochs, it shows over 90% accuracy. But when I used the test code as follows, the accuracy drops to 20%, even I tried swap the 'validation' to the 'train' data.

I have attached the code in the dropbox.
https://www.dropbox.com/sh/4pw58togdrrvnx1/AADCa67tn5OjC_BlcL-nwySpa?dl=0

Answer 3 · 2017-08-18T13:49:28.000Z

@zongyuange Do you have any other checkpoint files other than what you had in the training directory? You can try printing the checkpoint file in the evaluation code to see if you are getting the right checkpoint. There are many times I left the older checkpoints in the same file and the get_latest_checkpoint function took the wrong checkpoint file.

If this doesn't work, you can try:

Setting is_training=True for the model, and see if the fact that batch norm is deactivated is causing the problem. Also try is_training=True for the preprocessing, in total there should be 4 permutations.
Setting is_training=True for both the model and the data will be almost exactly like the training code, then try testing this on the training data. If this still does not produce the 90+% you seen in your training, then we should know where to isolate the problem in the evaluation code.

Answer 4 · 2017-08-21T01:05:35.000Z

@kwotsin, like you said, once I turn back to is_training=True during testing, the accuracy is back to normal. So I believe the model parameters is loaded correctly and the issue is inside the batch norm. Then I found this helpful post to explain why this is the case.
https://github.com/tensorflow/models/issues/391#issuecomment-247392028

Would you like to update your code or extra explanation for this? Because if I run your code straight away it might not produce the result we want.

Answer 5 · 2017-08-21T03:49:03.000Z

@zongyuange As mentioned in the thread you posted, the problem could be due to CUDA. Could you show me your CUDA version? I have just tested the code again, and this is my output:

2017-08-21 11:45:40.799001: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-08-21 11:45:40.799382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: GeForce GTX 860M
major: 5 minor: 0 memoryClockRate (GHz) 1.0195
pciBusID 0000:01:00.0
Total memory: 3.95GiB
Free memory: 3.27GiB
2017-08-21 11:45:40.799401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 
2017-08-21 11:45:40.799406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y 
2017-08-21 11:45:40.799424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 860M, pci bus id: 0000:01:00.0)
2017-08-21 11:45:41.840042: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-08-21 11:45:41.840069: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 8 visible devices
2017-08-21 11:45:41.840804: I tensorflow/compiler/xla/service/service.cc:198] XLA service 0xcf90630 executing computations on platform Host. Devices:
2017-08-21 11:45:41.840829: I tensorflow/compiler/xla/service/service.cc:206]   StreamExecutor device (0): <undefined>, <undefined>
2017-08-21 11:45:41.840985: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-08-21 11:45:41.840998: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 8 visible devices
2017-08-21 11:45:41.841420: I tensorflow/compiler/xla/service/service.cc:198] XLA service 0xc88fc50 executing computations on platform CUDA. Devices:
2017-08-21 11:45:41.841435: I tensorflow/compiler/xla/service/service.cc:206]   StreamExecutor device (0): GeForce GTX 860M, Compute Capability 5.0
INFO:tensorflow:Restoring parameters from ./log/model.ckpt-18280
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
INFO:tensorflow:Saving checkpoint to path ./log_eval_test/model.ckpt
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Epoch: 1/1
INFO:tensorflow:Current Streaming Accuracy: 0.0000
INFO:tensorflow:Global Step 1: Streaming Accuracy: 0.0000 (6.31 sec/step)
INFO:tensorflow:Global Step 2: Streaming Accuracy: 0.9722 (1.58 sec/step)
INFO:tensorflow:Global Step 3: Streaming Accuracy: 0.9722 (1.55 sec/step)
INFO:tensorflow:Global Step 4: Streaming Accuracy: 0.9722 (1.55 sec/step)

Note that the code was initially developed in TF 0.12, and now I'm testing it with TF 1.2.

So I suspect it might have to do with CUDA or your current version of TF, in which case you can try upgrading. I would have to verify the issue is not specific to you alone in order to make the code changes, otherwise the code will break for others.

Answer 6 · 2017-09-13T19:50:56.000Z

@zongyuange I have added your solution to the FAQ in case some other users face the same problem. Thank you so much for bringing this up.

Answer 7 · 2017-09-14T17:25:48.000Z

@kwotsin , thanks for doing this for other users. Apologise for not having enough time to try with different CUDA version, once I have any update in the future I will let you know immediately. Thanks again for making this wonderful tutorial.

Answer 8 · 2018-01-14T04:58:07.000Z

@zongyuange and @kwotsin : I have the same problem. I got around 99.80% accuracy in the training set but the validation just 64% although I was set is_training=False as your suggestion. I trained the network in other dataset (Xray image- RGB). I trained with the batch size of 32, and did not apply augmentation data (set is_training=False in the image = inception_preprocessing.preprocess_image(raw_image, height, width, False)) Could you predict what is my problem?

Answer 9 · 2018-01-14T22:58:35.000Z

@John1231983 : you can follow the FAQ in the front page, that should solve your problem. I think the problem is there is something wrong with batch_norm=Train/Test for the particular network structure. I didn't figure out the same problem with other networks.

Answer 10 · 2018-01-14T23:16:52.000Z

@zongyuange : I followed it and set the is_training=False. I think it may be overfitting problem. so I set the dropout to 0.5 and the result increasing to 75%. Because I have a small dataset (about 2000 images), so I think resnet is more suitable than inception-resnet. Am I right?

Answer 11 · 2018-02-18T13:37:59.000Z

Hey!
I had exactly the same problem with is_training, it is indeed due to the way the batch_norm parameters are updated. Tensorflow's batchnorm requires you to manually update the parameters during training by defining:

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    train_op = optimizer.minimize(loss)

as mentioned in a note in TF's documentation. Once you do this, setting is_training to false for evaluation will work as it should.

If you forget that, during eval TF will use the default values of mean and variance for batchnorm and this will break your activations, making it look like your network didn't train.

Answer 12 · 2018-02-18T22:58:23.000Z

@GPhilo Thanks for your answer!

Answer 13 · 2018-05-08T03:25:18.000Z

@GPhilo Hi, I have encountered a similar issue when trying to do transfer learning on MobileNetV2. At first I used the optimizer.minize way. Then I found some suggestions on StackoverFlow stating that slim.learning.create_train_op automatically do this:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) with tf.control_dependencies(update_ops): train_op = optimizer.minimize(loss)

Therefore, I changed from using optimizer.minimize to create_train_op, but the problem still exists.

Answer 14 · 2018-10-17T11:39:58.000Z

I think the problem people facing with is_training is related to the learning_rate parameter. In the code provided by @kwotsin, it uses exponential_decay learning rate which depends on decay_steps which depends on num_epochs_before_decay. num_epochs_before_decay parameter is very dependent to training data size, which means that when your training data is small your learning_rate decays rapidly. This might not effect weight update much, but its effect is significant to batch normalization parameter updates.

As it Rui explains here batch normalization behaves differently during training and testing.
During Training:

Normalize layer activations according to mini-batch statistics.
During the training step, update population statistics approximation via moving average of mini-batch statistics.

During Testing:

Normalize layer activations according to estimated population statistics.
Do not update population statistics according to mini-batch statistics from test data.

Therefore, when you set is_training=True during testing you get similar results to training values because it is using mini-batch statistics to normalize layer activations.

If you have such problem:
Check if you have small training data, increase value of num_epochs_before_decay or just use constant learning_rate. Increasing training steps also helps to update batch normalization statistics and have closer results to training accuracy.

Note: If you are planing to freeze your trained model and use somewhere setting is_training=True will not work