Apple Silicon M2 Loss Goes Up After 1-2 (Few) Epochs but not on Google Colab
Closed this issue ยท 12 comments
System information.
- Have I written custom code (as opposed to using a stock example script provided in Keras): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mac OS Ventura 13.5.1
- TensorFlow installed from (source or binary): Source
- TensorFlow version (use command below): 2.12.0 (Google Colab), 2.12.1 (Mac)
- Python version: 3.10.12 (Google Colab), 3.11.4 (Mac)
- GPU model and memory: T4 12GB RAM (Google Colab), Apple M2 Max (Mac)
Describe the problem.
I run this code on Mac (Apple Silicon M2 Max)
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
# Load the data
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
# Normalize pixel values
x_train = x_train / 255.0
model = Sequential([
Flatten(input_shape=(28, 28)),
Dense(512, activation='relu'),
Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
print("TensorFlow version: ", tf.__version__)
model.fit(x_train, y_train,
epochs=10)
Loss immediately starts increasing after 1 epoch.
/Users/felixm/anaconda3/envs/tf230808/bin/python /Users/felixm/PycharmProjects/TFExamStudy/Coursera/2309_scratch.py
TensorFlow version: 2.12.1
2023-09-03 16:53:30.575673: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
Epoch 1/10
1875/1875 [==============================] - 7s 3ms/step - loss: 0.3915 - accuracy: 0.8903
Epoch 2/10
1875/1875 [==============================] - 7s 4ms/step - loss: 0.4136 - accuracy: 0.8952
Epoch 3/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.4905 - accuracy: 0.8897
Epoch 4/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.5585 - accuracy: 0.8878
Epoch 5/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.6251 - accuracy: 0.8857
Epoch 6/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.7148 - accuracy: 0.8848
Epoch 7/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.7854 - accuracy: 0.8822
Epoch 8/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.8779 - accuracy: 0.8816
Epoch 9/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.9782 - accuracy: 0.8821
Epoch 10/10
1875/1875 [==============================] - 6s 3ms/step - loss: 1.0523 - accuracy: 0.8798
Process finished with exit code 0
Running the same code on Google Colab, the loss never increases.
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 0s 0us/step
TensorFlow version: 2.12.0
Epoch 1/10
1875/1875 [==============================] - 11s 3ms/step - loss: 0.2004 - accuracy: 0.9412
Epoch 2/10
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0810 - accuracy: 0.9750
Epoch 3/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0533 - accuracy: 0.9833
Epoch 4/10
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0363 - accuracy: 0.9883
Epoch 5/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0274 - accuracy: 0.9911
Epoch 6/10
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0204 - accuracy: 0.9937
Epoch 7/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0171 - accuracy: 0.9944
Epoch 8/10
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0131 - accuracy: 0.9955
Epoch 9/10
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0123 - accuracy: 0.9961
Epoch 10/10
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0114 - accuracy: 0.9960
<keras.callbacks.History at 0x780f114f2e60>
I am facing a similar issue with my Apple macBokAir M2
Hi @tilakrayal.
Is your recommendation to fix this issue to install TensorFlow version 2.15 then?
@felixm3,
Yeah. It was an issue on Tensorflow v2.12 and v2.13, and the issue was fixed on tf-nightly, which will be added to the upcoming Tensorflow stable version release. Thank you!
I tried with tf-nightly@2.15.0.dev20231011, problem still occurs.
I have the same problem. Just installed the latest TF version ('2.15.0-rc1') for my M2 Mac and my MNIST accuracy on a generic "Hello World" model architecture is significantly lower than in all the examples with exactly the same architecture.
To test it I just copied the code from a coursera course:
from tensorflow import keras
(img_train, label_train), (img_test, label_test) = keras.datasets.fashion_mnist.load_data()
img_train = img_train.astype('float32') / 255.0
img_test = img_test.astype('float32') / 255.0
b_model = keras.Sequential()
b_model.add(keras.layers.Flatten(input_shape=(28, 28)))
b_model.add(keras.layers.Dense(units=512, activation='relu', name='dense_1')) # You will tune this layer later
b_model.add(keras.layers.Dropout(0.2))
b_model.add(keras.layers.Dense(10, activation='softmax'))
b_model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss=keras.losses.SparseCategoricalCrossentropy(),
metrics=['accuracy'])
NUM_EPOCHS = 10
b_model.fit(img_train, label_train, epochs=NUM_EPOCHS, validation_split=0.2)
My val_accuracy on the last (10th) epoch is 0.8121
Same code in Google Colab yields 0.8798
And the loss starts to increase after the first epoch just as described before:
Epoch 1/10
9/1500 [..............................] - ETA: 9s - loss: 1.8536 - accuracy: 0.3403
2023-11-10 18:29:27.572235: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
2023-11-10 18:29:27.590041: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:961] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Adam/AssignAddVariableOp.
1500/1500 [==============================] - 9s 6ms/step - loss: 0.6242 - accuracy: 0.7904 - val_loss: 0.5325 - val_accuracy: 0.8293
Epoch 2/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.5740 - accuracy: 0.8128 - val_loss: 0.5735 - val_accuracy: 0.8299
Epoch 3/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.5921 - accuracy: 0.8137 - val_loss: 0.5299 - val_accuracy: 0.8384
Epoch 4/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.6266 - accuracy: 0.8121 - val_loss: 0.6594 - val_accuracy: 0.8000
Epoch 5/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.6714 - accuracy: 0.8074 - val_loss: 0.6754 - val_accuracy: 0.8223
Epoch 6/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.6945 - accuracy: 0.8095 - val_loss: 0.7407 - val_accuracy: 0.8025
Epoch 7/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.7269 - accuracy: 0.8083 - val_loss: 0.7297 - val_accuracy: 0.8076
Epoch 8/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.7516 - accuracy: 0.8076 - val_loss: 0.7026 - val_accuracy: 0.8145
Epoch 9/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.7732 - accuracy: 0.8071 - val_loss: 0.6357 - val_accuracy: 0.8290
Epoch 10/10
1500/1500 [==============================] - 8s 5ms/step - loss: 0.8255 - accuracy: 0.8070 - val_loss: 0.7302 - val_accuracy: 0.8157
I believe I had found a solution: change the hidden layer's activation function from relu
to tanh
does the trick.
I have the similar issue. tldr: seems the root cause is somewhere in tensorflow-metal
.
I'm trying to run book examples.
Distilled code version:
import tensorflow as tf
fashion_mnist = tf.keras.datasets.fashion_mnist.load_data()
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist
X_train, y_train = X_train_full[:-5000], y_train_full[:-5000]
X_valid, y_valid = X_train_full[-5000:], y_train_full[-5000:]
X_train, X_valid, X_test = X_train / 255., X_valid / 255., X_test / 255.
tf.random.set_seed(42)
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=[28, 28]),
tf.keras.layers.Dense(300, activation=tf.keras.activations.relu),
tf.keras.layers.Dense(100, activation=tf.keras.activations.relu),
tf.keras.layers.Dense(10, activation=tf.keras.activations.softmax),
])
model.summary()
model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=30, validation_data=(X_valid, y_valid))
After some # of epochs I get accuracy dropped and loss exploded (you can check full logs attached):
Epoch 15/30
1719/1719 [==============================] - 8s 5ms/step - loss: 0.9868 - accuracy: 0.7996 - val_loss: 1.0065 - val_accuracy: 0.8128
Epoch 16/30
1719/1719 [==============================] - 8s 5ms/step - loss: 1.1898 - accuracy: 0.7952 - val_loss: 2.3624 - val_accuracy: 0.7086
Epoch 17/30
1719/1719 [==============================] - 8s 5ms/step - loss: 1.6663 - accuracy: 0.7837 - val_loss: 2.4961 - val_accuracy: 0.7822
Epoch 18/30
1719/1719 [==============================] - 8s 5ms/step - loss: 2.4820 - accuracy: 0.7758 - val_loss: 4.9756 - val_accuracy: 0.6454
Epoch 19/30
1719/1719 [==============================] - 9s 5ms/step - loss: 4.7354 - accuracy: 0.7551 - val_loss: 10.3181 - val_accuracy: 0.6346
Epoch 20/30
1719/1719 [==============================] - 8s 5ms/step - loss: 11.1815 - accuracy: 0.7318 - val_loss: 10.7277 - val_accuracy: 0.7696
Epoch 21/30
1719/1719 [==============================] - 8s 5ms/step - loss: 36.0803 - accuracy: 0.7089 - val_loss: 49.7303 - val_accuracy: 0.7086
Epoch 22/30
1719/1719 [==============================] - 8s 5ms/step - loss: 165.2742 - accuracy: 0.6973 - val_loss: 651.5690 - val_accuracy: 0.5884
Epoch 23/30
1719/1719 [==============================] - 8s 5ms/step - loss: nan - accuracy: 0.6632 - val_loss: nan - val_accuracy: 0.1042
Epoch 24/30
1719/1719 [==============================] - 8s 5ms/step - loss: nan - accuracy: 0.0996 - val_loss: nan - val_accuracy: 0.1042
I use MacBook Air with Apple M2 running Sonoma 14.1.1.
$ python --version
Python 3.11.6
Tensorflow packages installed (tried the same with 2.14.0
):
$ pip list | grep tensorflow
tensorflow 2.15.0
tensorflow-estimator 2.15.0
tensorflow-io-gcs-filesystem 0.34.0
tensorflow-macos 2.15.0
tensorflow-metal 1.1.0
I found that above example works fine if I uninstall tensorflow-metal
and move all the workloads to cpu. My activity monitor with metal plugin:
and without metal plugin:
Also, in this case running time for cpu-only is even better:
Epoch 8/30
1719/1719 [==============================] - 1s 673us/step - loss: 0.3569 - accuracy: 0.8745 - val_loss: 0.3749 - val_accuracy: 0.8616
But this doesn't work for bigger (different) workloads, tested on Apple's example and get much worse performance without metal support.
Also, I've tired:
- Changing
relu
totanh
- kinda works in my case. - Cast
X_train
tofloat32
(to match with model params type) - doesn't help.
Instead of uninstall you can disable GPU device as suggested in one more similar issue:
hw = tf.config.get_visible_devices()
tf.config.set_visible_devices(hw[0])
I had a similar problem, and it was also solved by uninstalling tensorflow-metal
I believe I had found a solution: change the hidden layer's activation function from
relu
totanh
does the trick.
Also softplus
activation function does not seem to suffer from this problem.