Apple Silicon M2 Loss Goes Up After 1-2 (Few) Epochs but not on Google Colab

Question

Apple Silicon M2 Loss Goes Up After 1-2 (Few) Epochs but not on Google Colab

Closed this issue 9 months ago · 12 comments

System information.

Have I written custom code (as opposed to using a stock example script provided in Keras): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mac OS Ventura 13.5.1
TensorFlow installed from (source or binary): Source
TensorFlow version (use command below): 2.12.0 (Google Colab), 2.12.1 (Mac)
Python version: 3.10.12 (Google Colab), 3.11.4 (Mac)
GPU model and memory: T4 12GB RAM (Google Colab), Apple M2 Max (Mac)

Describe the problem.

I run this code on Mac (Apple Silicon M2 Max)

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten

# Load the data
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()

# Normalize pixel values
x_train = x_train / 255.0

model = Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(512, activation='relu'),
    Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

print("TensorFlow version: ", tf.__version__)

model.fit(x_train, y_train,
          epochs=10)

Loss immediately starts increasing after 1 epoch.

/Users/felixm/anaconda3/envs/tf230808/bin/python /Users/felixm/PycharmProjects/TFExamStudy/Coursera/2309_scratch.py 
TensorFlow version:  2.12.1
2023-09-03 16:53:30.575673: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
Epoch 1/10
1875/1875 [==============================] - 7s 3ms/step - loss: 0.3915 - accuracy: 0.8903
Epoch 2/10
1875/1875 [==============================] - 7s 4ms/step - loss: 0.4136 - accuracy: 0.8952
Epoch 3/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.4905 - accuracy: 0.8897
Epoch 4/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.5585 - accuracy: 0.8878
Epoch 5/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.6251 - accuracy: 0.8857
Epoch 6/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.7148 - accuracy: 0.8848
Epoch 7/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.7854 - accuracy: 0.8822
Epoch 8/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.8779 - accuracy: 0.8816
Epoch 9/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.9782 - accuracy: 0.8821
Epoch 10/10
1875/1875 [==============================] - 6s 3ms/step - loss: 1.0523 - accuracy: 0.8798

Process finished with exit code 0

Running the same code on Google Colab, the loss never increases.


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 0s 0us/step
TensorFlow version:  2.12.0
Epoch 1/10
1875/1875 [==============================] - 11s 3ms/step - loss: 0.2004 - accuracy: 0.9412
Epoch 2/10
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0810 - accuracy: 0.9750
Epoch 3/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0533 - accuracy: 0.9833
Epoch 4/10
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0363 - accuracy: 0.9883
Epoch 5/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0274 - accuracy: 0.9911
Epoch 6/10
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0204 - accuracy: 0.9937
Epoch 7/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0171 - accuracy: 0.9944
Epoch 8/10
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0131 - accuracy: 0.9955
Epoch 9/10
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0123 - accuracy: 0.9961
Epoch 10/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0114 - accuracy: 0.9960
<keras.callbacks.History at 0x780f114f2e60>

Answer 1 · 2023-09-05T02:37:49.000Z

I am facing a similar issue with my Apple macBokAir M2

Answer 2 · 2023-09-05T09:46:53.000Z

@felixm3,
As mentioned, the loss was increasing on both v2.12 and v2.13, but whereas when we tried to execute the same code on tf-nightly the loss was reduced and also the accuracy also increased. Kindly find the gist of it here and the MacOS screenshot for the reference.

Thank you!

Answer 3 · 2023-09-05T19:12:49.000Z

Hi @tilakrayal.

Is your recommendation to fix this issue to install TensorFlow version 2.15 then?

Answer 4 · 2023-09-06T05:01:08.000Z

@felixm3,
Yeah. It was an issue on Tensorflow v2.12 and v2.13, and the issue was fixed on tf-nightly, which will be added to the upcoming Tensorflow stable version release. Thank you!

Answer 5 · 2023-09-06T05:53:30.000Z

Are you satisfied with the resolution of your issue?
Yes
No

Answer 6 · 2023-10-22T22:04:06.000Z

I tried with tf-nightly@2.15.0.dev20231011, problem still occurs.

Answer 7 · 2023-11-10T14:16:50.000Z

I have the same problem. Just installed the latest TF version ('2.15.0-rc1') for my M2 Mac and my MNIST accuracy on a generic "Hello World" model architecture is significantly lower than in all the examples with exactly the same architecture.

To test it I just copied the code from a coursera course:

from tensorflow import keras

(img_train, label_train), (img_test, label_test) = keras.datasets.fashion_mnist.load_data()

img_train = img_train.astype('float32') / 255.0
img_test = img_test.astype('float32') / 255.0

b_model = keras.Sequential()
b_model.add(keras.layers.Flatten(input_shape=(28, 28)))
b_model.add(keras.layers.Dense(units=512, activation='relu', name='dense_1')) # You will tune this layer later
b_model.add(keras.layers.Dropout(0.2))
b_model.add(keras.layers.Dense(10, activation='softmax'))

b_model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),
            loss=keras.losses.SparseCategoricalCrossentropy(),
            metrics=['accuracy'])

NUM_EPOCHS = 10

b_model.fit(img_train, label_train, epochs=NUM_EPOCHS, validation_split=0.2)

My val_accuracy on the last (10th) epoch is 0.8121
Same code in Google Colab yields 0.8798

And the loss starts to increase after the first epoch just as described before:

Epoch 1/10
   9/1500 [..............................] - ETA: 9s - loss: 1.8536 - accuracy: 0.3403  
2023-11-10 18:29:27.572235: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
2023-11-10 18:29:27.590041: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:961] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Adam/AssignAddVariableOp.
1500/1500 [==============================] - 9s 6ms/step - loss: 0.6242 - accuracy: 0.7904 - val_loss: 0.5325 - val_accuracy: 0.8293
Epoch 2/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.5740 - accuracy: 0.8128 - val_loss: 0.5735 - val_accuracy: 0.8299
Epoch 3/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.5921 - accuracy: 0.8137 - val_loss: 0.5299 - val_accuracy: 0.8384
Epoch 4/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.6266 - accuracy: 0.8121 - val_loss: 0.6594 - val_accuracy: 0.8000
Epoch 5/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.6714 - accuracy: 0.8074 - val_loss: 0.6754 - val_accuracy: 0.8223
Epoch 6/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.6945 - accuracy: 0.8095 - val_loss: 0.7407 - val_accuracy: 0.8025
Epoch 7/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.7269 - accuracy: 0.8083 - val_loss: 0.7297 - val_accuracy: 0.8076
Epoch 8/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.7516 - accuracy: 0.8076 - val_loss: 0.7026 - val_accuracy: 0.8145
Epoch 9/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.7732 - accuracy: 0.8071 - val_loss: 0.6357 - val_accuracy: 0.8290
Epoch 10/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.8255 - accuracy: 0.8070 - val_loss: 0.7302 - val_accuracy: 0.8157

Answer 8 · 2023-11-10T15:50:58.000Z

I believe I had found a solution: change the hidden layer's activation function from relu to tanh does the trick.

Answer 9 · 2023-11-23T12:15:30.000Z

I have the similar issue. tldr: seems the root cause is somewhere in tensorflow-metal.

I'm trying to run book examples.

Distilled code version:

import tensorflow as tf

fashion_mnist = tf.keras.datasets.fashion_mnist.load_data()
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist
X_train, y_train = X_train_full[:-5000], y_train_full[:-5000]
X_valid, y_valid = X_train_full[-5000:], y_train_full[-5000:]

X_train, X_valid, X_test = X_train / 255., X_valid / 255., X_test / 255.

tf.random.set_seed(42)

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, activation=tf.keras.activations.relu),
    tf.keras.layers.Dense(100, activation=tf.keras.activations.relu),
    tf.keras.layers.Dense(10, activation=tf.keras.activations.softmax),
])
model.summary()

model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=30, validation_data=(X_valid, y_valid))

After some # of epochs I get accuracy dropped and loss exploded (you can check full logs attached):

Epoch 15/30
1719/1719 [==============================] - 8s 5ms/step - loss: 0.9868 - accuracy: 0.7996 - val_loss: 1.0065 - val_accuracy: 0.8128
Epoch 16/30
1719/1719 [==============================] - 8s 5ms/step - loss: 1.1898 - accuracy: 0.7952 - val_loss: 2.3624 - val_accuracy: 0.7086
Epoch 17/30
1719/1719 [==============================] - 8s 5ms/step - loss: 1.6663 - accuracy: 0.7837 - val_loss: 2.4961 - val_accuracy: 0.7822
Epoch 18/30
1719/1719 [==============================] - 8s 5ms/step - loss: 2.4820 - accuracy: 0.7758 - val_loss: 4.9756 - val_accuracy: 0.6454
Epoch 19/30
1719/1719 [==============================] - 9s 5ms/step - loss: 4.7354 - accuracy: 0.7551 - val_loss: 10.3181 - val_accuracy: 0.6346
Epoch 20/30
1719/1719 [==============================] - 8s 5ms/step - loss: 11.1815 - accuracy: 0.7318 - val_loss: 10.7277 - val_accuracy: 0.7696
Epoch 21/30
1719/1719 [==============================] - 8s 5ms/step - loss: 36.0803 - accuracy: 0.7089 - val_loss: 49.7303 - val_accuracy: 0.7086
Epoch 22/30
1719/1719 [==============================] - 8s 5ms/step - loss: 165.2742 - accuracy: 0.6973 - val_loss: 651.5690 - val_accuracy: 0.5884
Epoch 23/30
1719/1719 [==============================] - 8s 5ms/step - loss: nan - accuracy: 0.6632 - val_loss: nan - val_accuracy: 0.1042
Epoch 24/30
1719/1719 [==============================] - 8s 5ms/step - loss: nan - accuracy: 0.0996 - val_loss: nan - val_accuracy: 0.1042

I use MacBook Air with Apple M2 running Sonoma 14.1.1.

$ python --version
Python 3.11.6

Tensorflow packages installed (tried the same with 2.14.0):

$ pip list | grep tensorflow   
tensorflow                   2.15.0
tensorflow-estimator         2.15.0
tensorflow-io-gcs-filesystem 0.34.0
tensorflow-macos             2.15.0
tensorflow-metal             1.1.0

I found that above example works fine if I uninstall tensorflow-metal and move all the workloads to cpu. My activity monitor with metal plugin:

and without metal plugin:

Also, in this case running time for cpu-only is even better:

Epoch 8/30
1719/1719 [==============================] - 1s 673us/step - loss: 0.3569 - accuracy: 0.8745 - val_loss: 0.3749 - val_accuracy: 0.8616

But this doesn't work for bigger (different) workloads, tested on Apple's example and get much worse performance without metal support.

Also, I've tired:

Changing relu to tanh - kinda works in my case.
Cast X_train to float32 (to match with model params type) - doesn't help.

Instead of uninstall you can disable GPU device as suggested in one more similar issue:

hw = tf.config.get_visible_devices()
tf.config.set_visible_devices(hw[0])

tf-issue-with-metal.txt
tf-issue-without-metal.txt

Answer 10 · 2023-11-27T08:46:18.000Z

Okay, I have ran the test on python environment with metal and without metal and can confirm that it is indeed the case: tensorflow-metal results in a much lower validation score.

Looks like the reason is tensorflow-metal

Answer 11 · 2024-01-15T13:47:36.000Z

I had a similar problem, and it was also solved by uninstalling tensorflow-metal

Answer 12 · 2024-04-02T08:56:16.000Z

I believe I had found a solution: change the hidden layer's activation function from relu to tanh does the trick.

Also softplus activation function does not seem to suffer from this problem.