SciSharp/TensorFlow.NET

[Performance Issue]: Model Convergence issue in Keras with Parallel Execution of the fit() Method ?

AdrienDeverin opened this issue · 7 comments

Brief Description

I've encountered a perplexing issue while utilizing Keras and its fit() function to train a standard CNN.

To illustrate, consider the following code snippet where the model learns and converges successfully:

// Get the data
int epoch = 100;
TensorLoader tensorLoader = new TensorLoader();
(NDArray xData, NDArray yData) = tensorLoader.GetDatasFromZeroFormat(@"C:\GitHub\CSharpCode\Solutions\PNN_Master_Delta\_Datas_\Cnn_Inputs\GSCD_4+1Classes_Tensors_trigger_16_zeroF_v2_f6");

// Model initialization and setup
IModel? model = keras.models.load_model(@"C:\GitHub\CSharpCode\Solutions\PNN_Master_Delta\_Datas_\Cnn_Model\GSCD_16x90_4+1classes_trigger_n00");
IOptimizer optimizer = keras.optimizers.Adam(0.01f);
ILossFunc loss = keras.losses.BinaryCrossentropy();

// Compile (IMPORTANTE LINE)
model.compile(optimizer: optimizer, loss: loss, metrics: new string[1] { "acc" });

// Parallel execution for training
Thread t = new Thread(() =>
{
    List<ICallback> callbacks = new List<ICallback> { new EarlyStoppingCallback(
                    new Tensorflow.Keras.Callbacks.CallbackParams { Model = model, Epochs = epoch, Verbose = -1},
                    monitor: "val_loss", patience: 5) }; // Early Stopping if problem 
    model.fit(xData, yData,
             batch_size: 32,
             epochs: epoch,
             validation_split: 0.1f,
             callbacks: callbacks,
             shuffle: false, verbose: -1);
});
t.SetApartmentState(ApartmentState.STA);
t.Start();

However, when the model.compile() and model.fit() functions are executed sequentially in the same thread, the model seemingly learns but fails to converge, as demonstrated below:

IModel? model = null;
string folder = @"C:\GitHub\CSharpCode\Solutions\PNN_Master_Delta\_Datas_\Cnn_Model\GSCD_16x90_4+1classes_trigger_n00"; // a path with a saved_model 

Thread t = new Thread(() =>
{
    // Loading the model
    model = keras.models.load_model(folder);
    IOptimizer optimizer = keras.optimizers.Adam(0.01f);
    ILossFunc loss = keras.losses.BinaryCrossentropy();
    model.compile(optimizer: optimizer, loss: loss, metrics: new string[1] { "acc" });

    // Model training
    List<ICallback> callbacks = new List<ICallback> { new EarlyStoppingCallback(
                    new Tensorflow.Keras.Callbacks.CallbackParams { Model = model, Epochs = epoch, Verbose = -1},
                    monitor: "val_loss", patience: 5) }; // Early Stopping if problem 
    model.fit(xData, yData,
             batch_size: 32,
             epochs: epoch,
             validation_split: 0.1f,
             callbacks: callbacks,
             shuffle: false, verbose: -1);
});
t.SetApartmentState(ApartmentState.STA);
t.Start();

Remarkably, the convergence issue appears to be entirely deterministic (used obviously over the same dataset) between these two examples. The logs are send below.
Despite thorough investigation, I've failed to identify any critical differences. All variables remain consistent.

Specifically, it seems that the function model.compile() must not be executed in the same thread as the fit() function for successful convergence. I don't understand why...

Any insights or suggestions on this peculiar behavior would be greatly appreciated. Datas used can be given.

Best regards,
DEVERIN Adrien

Device and Context

Used on CPU

Benchmark

Logs example 1 (convergence working) :

2024-05-13 16:51:59.842255: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-05-13 16:51:59.847909: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Data Loaded sucessfully ...
Epoch: 001/100
0076/0076 [=====] - 5ms/step - loss: 0,272631 - accuracy: 0,883721 - val_loss: 0,274186 - val_accuracy: 0,902622
Epoch: 002/100
0076/0076 [=====] - 5ms/step - loss: 0,271166 - accuracy: 0,883721 - val_loss: 0,272107 - val_accuracy: 0,902622
Epoch: 003/100
0076/0076 [=====] - 5ms/step - loss: 0,270275 - accuracy: 0,880814 - val_loss: 0,270979 - val_accuracy: 0,902622
Epoch: 004/100
0076/0076 [=====] - 6ms/step - loss: 0,269701 - accuracy: 0,880814 - val_loss: 0,270195 - val_accuracy: 0,902622
Epoch: 005/100
0076/0076 [=====] - 5ms/step - loss: 0,269153 - accuracy: 0,879568 - val_loss: 0,269510 - val_accuracy: 0,902622
Epoch: 006/100
0076/0076 [=====] - 5ms/step - loss: 0,268411 - accuracy: 0,883306 - val_loss: 0,268801 - val_accuracy: 0,902622
Epoch: 007/100
0076/0076 [=====] - 5ms/step - loss: 0,267722 - accuracy: 0,884136 - val_loss: 0,268095 - val_accuracy: 0,902622
Epoch: 008/100
0076/0076 [=====] - 5ms/step - loss: 0,267106 - accuracy: 0,885382 - val_loss: 0,267437 - val_accuracy: 0,902622
Epoch: 009/100
0076/0076 [=====] - 5ms/step - loss: 0,266548 - accuracy: 0,886628 - val_loss: 0,266843 - val_accuracy: 0,902622
Epoch: 010/100
0076/0076 [=====] - 5ms/step - loss: 0,266097 - accuracy: 0,886628 - val_loss: 0,266379 - val_accuracy: 0,902622
Epoch: 011/100
0076/0076 [=====] - 5ms/step - loss: 0,265732 - accuracy: 0,886213 - val_loss: 0,265995 - val_accuracy: 0,902622
Epoch: 012/100
0076/0076 [=====] - 6ms/step - loss: 0,265397 - accuracy: 0,883721 - val_loss: 0,265670 - val_accuracy: 0,902622
Epoch: 013/100
0076/0076 [=====] - 5ms/step - loss: 0,265122 - accuracy: 0,885797 - val_loss: 0,265376 - val_accuracy: 0,902622
Epoch: 014/100
0076/0076 [=====] - 5ms/step - loss: 0,264859 - accuracy: 0,885382 - val_loss: 0,265103 - val_accuracy: 0,902622
Epoch: 015/100
0076/0076 [=====] - 7ms/step - loss: 0,264639 - accuracy: 0,885797 - val_loss: 0,264853 - val_accuracy: 0,902622
Epoch: 016/100
0076/0076 [=====] - 6ms/step - loss: 0,264446 - accuracy: 0,886213 - val_loss: 0,264638 - val_accuracy: 0,902622
Epoch: 017/100
0076/0076 [=====] - 5ms/step - loss: 0,264189 - accuracy: 0,885797 - val_loss: 0,264377 - val_accuracy: 0,898876
Epoch: 018/100
0076/0076 [=====] - 6ms/step - loss: 0,263984 - accuracy: 0,884551 - val_loss: 0,264165 - val_accuracy: 0,898876
Epoch: 019/100
0076/0076 [=====] - 4ms/step - loss: 0,263786 - accuracy: 0,884551 - val_loss: 0,263960 - val_accuracy: 0,898876
Epoch: 020/100
0076/0076 [=====] - 4ms/step - loss: 0,263596 - accuracy: 0,885382 - val_loss: 0,263759 - val_accuracy: 0,898876

ect ...

Logs example 2 (convergence doesn't work) :

2024-05-13 16:54:19.347661: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-05-13 16:54:19.354123: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Data Loaded sucessfully ...
Epoch: 001/100
0076/0076 [=====] - 5ms/step - loss: 0,269264 - accuracy: 0,886213 - val_loss: 0,274375 - val_accuracy: 0,880150
Epoch: 002/100
0076/0076 [=====] - 5ms/step - loss: 0,268268 - accuracy: 0,887043 - val_loss: 0,268760 - val_accuracy: 0,880150
Epoch: 003/100
0076/0076 [=====] - 5ms/step - loss: 0,267617 - accuracy: 0,884551 - val_loss: 0,271331 - val_accuracy: 0,880150
Epoch: 004/100
0076/0076 [=====] - 7ms/step - loss: 0,266189 - accuracy: 0,886213 - val_loss: 0,273943 - val_accuracy: 0,880150
Epoch: 005/100
0076/0076 [=====] - 5ms/step - loss: 0,266029 - accuracy: 0,884136 - val_loss: 0,275886 - val_accuracy: 0,880150
Epoch: 006/100
0076/0076 [=====] - 5ms/step - loss: 0,264438 - accuracy: 0,885797 - val_loss: 0,278307 - val_accuracy: 0,876405
Epoch: 007/100
0076/0076 [=====] - 5ms/step - loss: 0,263542 - accuracy: 0,884967 - val_loss: 0,277512 - val_accuracy: 0,880150
Epoch: 008/100
0076/0076 [=====] - 5ms/step - loss: 0,264521 - accuracy: 0,884967 - val_loss: 0,280630 - val_accuracy: 0,876405
Restoring model weights from the end of the best epoch...

Alternatives

No response

Hi, the losses in the two logs seem to have little difference. In the second log, only the last step has an increasing loss. What is the loss like after that?

Are you using the debug mode or release mode? If you are in debug mode, will the result differ under release mode?

I have no idea about it yet. model.compile is just to gather all the information as a preparation for graph building. What if put the following code in the main thread

model = keras.models.load_model(folder);
IOptimizer optimizer = keras.optimizers.Adam(0.01f);
ILossFunc loss = keras.losses.BinaryCrossentropy();

but keep model.compile in the separated thread along with model.fit?

I'll add another strange observation. I tried another example, the one from the TensorFlow.NET GitHub page, and got the opposite result!

Below is the code used:

// Load data
var ((x_train, y_train), (x_test, y_test)) = keras.datasets.cifar10.load_data();
x_train = x_train / 255.0f;

// Construct Model (could be replace by a the load_model() function, it will have the same impact)
var layers = keras.layers;
var inputs = keras.Input(shape: (32, 32, 3), name: "img");
var x = layers.Conv2D(32, 3, activation: "relu").Apply(inputs);
x = layers.Conv2D(64, 3, activation: "relu").Apply(x);
var block_1_output = layers.MaxPooling2D(3).Apply(x);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(block_1_output);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(x);
var block_2_output = layers.Add().Apply(new Tensors(x, block_1_output));
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(block_2_output);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(x);
var block_3_output = layers.Add().Apply(new Tensors(x, block_2_output));
x = layers.Conv2D(64, 3, activation: "relu").Apply(block_3_output);
x = layers.GlobalAveragePooling2D().Apply(x);
x = layers.Dense(256, activation: "relu").Apply(x);
x = layers.Dropout(0.5f).Apply(x);
var outputs = layers.Dense(10).Apply(x);
var model = keras.Model(inputs, outputs, name: "toy_resnet");

// Compile
model.compile(optimizer: keras.optimizers.RMSprop(1e-3f), loss: keras.losses.SparseCategoricalCrossentropy(from_logits: true), metrics: new[] { "acc" });

Thread t = new Thread(() =>
{
	model.fit(x_train[new Slice(0, 2000)], y_train[new Slice(0, 2000)],
          	batch_size: 64,
          	epochs: 30,
          	validation_split: 0.2f);
});
t.SetApartmentState(ApartmentState.STA);
t.Start();

This example, similar to example 1 in my thread, DOESN'T CONVERGE!

Contrarily, this code, similar to my example 2, converges:

// Load data
var ((x_train, y_train), (x_test, y_test)) = keras.datasets.cifar10.load_data();
x_train = x_train / 255.0f;

Thread t = new Thread(() =>
{
	// Construct Model (could be replace by a the load_model() function, it will have the same impact)
	var layers = keras.layers;
	var inputs = keras.Input(shape: (32, 32, 3), name: "img");
	var x = layers.Conv2D(32, 3, activation: "relu").Apply(inputs);
	x = layers.Conv2D(64, 3, activation: "relu").Apply(x);
	var block_1_output = layers.MaxPooling2D(3).Apply(x);
	x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(block_1_output);
	x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(x);
	var block_2_output = layers.Add().Apply(new Tensors(x, block_1_output));
	x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(block_2_output);
	x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(x);
	var block_3_output = layers.Add().Apply(new Tensors(x, block_2_output));
	x = layers.Conv2D(64, 3, activation: "relu").Apply(block_3_output);
	x = layers.GlobalAveragePooling2D().Apply(x);
	x = layers.Dense(256, activation: "relu").Apply(x);
	x = layers.Dropout(0.5f).Apply(x);
	var outputs = layers.Dense(10).Apply(x);
	var model = keras.Model(inputs, outputs, name: "toy_resnet");

	// Compile
	model.compile(optimizer: keras.optimizers.RMSprop(1e-3f), loss: keras.losses.SparseCategoricalCrossentropy(from_logits: true), metrics: new[] { "acc" });
	model.fit(x_train[new Slice(0, 2000)], y_train[new Slice(0, 2000)],
          	batch_size: 64,
          	epochs: 30,
          	validation_split: 0.2f);
});
t.SetApartmentState(ApartmentState.STA);
t.Start();

This reveals a convergence issue opposite to the one I initially described in this problem. Furthermore, in this exemple, the more the graph is set in the same thread to the model.fit() code, the more the convergence will occur.

To support this, consider situation 3 (a mix of 1 and 2):

// Load data
var ((x_train, y_train), (x_test, y_test)) = keras.datasets.cifar10.load_data();
x_train = x_train / 255.0f;
// Construct Model (could be replace by a the load_model() function, it will have the same impact)
var layers = keras.layers;
var inputs = keras.Input(shape: (32, 32, 3), name: "img");
var x = layers.Conv2D(32, 3, activation: "relu").Apply(inputs);
x = layers.Conv2D(64, 3, activation: "relu").Apply(x);
var block_1_output = layers.MaxPooling2D(3).Apply(x);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(block_1_output);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(x);
var block_2_output = layers.Add().Apply(new Tensors(x, block_1_output));
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(block_2_output);
x = layers.Conv2D(64, 3, activation: "relu", padding: "same").Apply(x);

Thread t = new Thread(() =>
{
	
	var block_3_output = layers.Add().Apply(new Tensors(x, block_2_output));
	x = layers.Conv2D(64, 3, activation: "relu").Apply(block_3_output);
	x = layers.GlobalAveragePooling2D().Apply(x);
	x = layers.Dense(256, activation: "relu").Apply(x);
	x = layers.Dropout(0.5f).Apply(x);
	var outputs = layers.Dense(10).Apply(x);
	var model = keras.Model(inputs, outputs, name: "toy_resnet");

	// Compile
	model.compile(optimizer: keras.optimizers.RMSprop(1e-3f), loss: keras.losses.SparseCategoricalCrossentropy(from_logits: true), metrics: new[] { "acc" });
	model.fit(x_train[new Slice(0, 2000)], y_train[new Slice(0, 2000)],
          	batch_size: 64,
          	epochs: 30,
          	validation_split: 0.2f);
});
t.SetApartmentState(ApartmentState.STA);
t.Start();

This produces a better loss (acc = 0.3) than code 1 (acc = 0.1) but less than code 2 (acc = 0.45). This behavior is totally deterministic (not just one test has been done).

Edit : But all of this have maybe an impact due to the optimizer used. I replace RSMprop by Adam and this exemple converge in case 1,2,3.

I can't explain this behavior, which is totally opposite to my original problem but also indicates an underlying problem with the threads..

That's so weird! What's your code like after t.Start()?

@AsakusaRinne there are litterally nothing