LSTM - different outputs for same weights across CPU and GPU, when using float32 + tf-keras + NVIDIA A100

Question

LSTM - different outputs for same weights across CPU and GPU, when using float32 + tf-keras + NVIDIA A100

Opened this issue 2 months ago · 4 comments

System information

Custom Code: YES
OS: SUSE Linux Enterprise High Performance Computing 15 SP5
TensorFlow installed from: DOCKER (tensorflow/tensorflow:2.16.1-gpu-jupyter)
TensorFlow version: v2.16.1-0-g5bc9d26649c 2.16.1
Python version: 3.11
GPU model and memory: NVIDIA A100-PCIE-40GB
Code to reproduce: find below

Describe the problem
I have a model comprising almost entirely of LSTM layers. If I load the same weights into a copy of the model instanced to run on CPU and GPU, results are different.

This issue disappears (the GPU results change to match CPU) if I change any of these:

Move from
- SLES + NVIDIA A100 + Driver Version: 550.54.14 + CUDA Version: 12.4
  to
- Ubuntu 22.04.4 LTS NVIDIA V100 + Driver Version: 535.161.07 + CUDA Version: 12.2
Set keras.backend.set_floatx('float64')
Use keras 3 instead of tf-keras

In all these cases, I'm running the same (official) docker image, in which my only modification has been to install tf-keras==2.16.0 and plotly.

Standalone code to reproduce the issue.

!pip install plotly
!pip install tf-keras==2.16.0

import os
import tensorflow as tf

import numpy as np

USE_TF_KERAS = True

if USE_TF_KERAS:
    import tf_keras as keras
    from tf_keras import layers
    from tf_keras import initializers
    from tf_keras import backend as K
else:
    import keras
    from keras import layers
    from keras import initializers
    from keras import backend as K

# Setting float64 as default dtype removes the discrepancy between CPU and GPU!
# keras.backend.set_floatx('float64')
from plotly import graph_objects as go

ROOT_DIR = os.getcwd()

n_time_steps = 800

theta = np.linspace(0, 2 * np.pi, n_time_steps).reshape(1, -1)

np.random.seed(42)
tf.random.set_seed(42)
dummy_input_dict = {
    "input_a": 800
    * np.stack((np.cos(theta), np.sin(theta)), axis=-1).astype(np.float32),
    "input_b": np.random.rand(1, n_time_steps, 5).astype(np.float32),
}


def build_model():
    input_a = layers.Input(shape=(n_time_steps, 2), name="input_a")
    input_b = layers.Input(shape=(n_time_steps, 5), name="input_b")

    x = layers.Concatenate()([input_a, input_b])
    for idx in range(8):
        lstm_layer = layers.LSTM(
                1024,
                kernel_initializer=initializers.RandomNormal(seed=42 + idx),
                recurrent_initializer=initializers.RandomNormal(seed=52 + idx),
                return_sequences=True,
            )
        x = lstm_layer(x)
    y = layers.Dense(1)(x)
    model = keras.Model(inputs=[input_a, input_b], outputs=y)

    return model


def main(device):
    with tf.device(device):
        model = build_model()
        model.load_weights("my_initial_weights.h5")

        features = ["input_a", "input_b"]
        dummy_input = [dummy_input_dict[k] for k in features]
        preds = model.predict(dummy_input)

    return preds

# Save one set of weights, so that we can compare the weights of the two models
with tf.device("/device:CPU:0"):
    model = build_model()
    model.save_weights("my_initial_weights.h5")


tf.config.list_logical_devices()

cpu_preds = main("/device:CPU:0")
gpu_preds = main("/device:GPU:0")

cpu_output = cpu_preds[0, :, 0]
gpu_output = gpu_preds[0, :, 0]

fig = go.Figure()
fig.add_trace(go.Scatter(y=cpu_output, name="CPU"))
fig.add_trace(go.Scatter(y=gpu_output, name="GPU"))
fig.show()

Resulting plot:

As mentioned at the beginning:

changing host to my V100 host
uncommenting # keras.backend.set_floatx('float64')
setting USE_TF_KERAS = False

All workaround the issue, and the GPU prediction matches the CPU prediction.

I also re-iterate that all of this has been run in the official tensorflow/tensorflow:2.16.1-gpu-jupyter container, on both hosts.

Answer 1 · 2024-04-12T07:53:45.000Z

@sachinprasadhs,
I was able to reproduce the issue on tensorflow v2.15, tf-keras. Kindly find the gist of it here.

Answer 2 · 2024-04-12T08:00:18.000Z

@tilakrayal - the gist shows a very small difference between CPU/GPU predictions, similar to what I see on my V100 host. I wouldn't be surprised if differences that small were in fact expected.

But on my A100 host the difference becomes orders of magnitude larger. Is there a way to replicate my "problematic system" (NVIDIA A100 + Driver Version: 550.54.14 + CUDA Version: 12.4) on Colab, so that hopefully you can also see the entity of the problem, beyond the screenshots I can share?

Thanks!

Answer 3 · 2024-04-15T14:45:44.000Z

I've updated the V100 system. It now has the exact same driver + CUDA as the A100 system (Driver Version: 550.54.14 + CUDA Version: 12.4), and still does not replicate the issue. So the issue seems specific to execution on the A100. How can we replicate on Colab? Thanks.

Answer 4 · 2024-05-13T15:17:23.000Z

Latest update: I got hold of an H200 system, which demonstrated the same issue I see on the A100. I've also become aware of the relatively new TF32 datatype https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_tensor_float_32_execution, which is apparently enabled by default on A100 and newer!

Indeed, if I modify my example script and set tf.config.experimental.enable_tensor_float_32_execution(False), the numerical issues disappear, and the A100 system produces the same output as the V100 and CPUs.

I find it quite concerning that Tensorflow would take such liberties with data types.

In any case, the main question mark I have at this point is why I don't see the same numerical issues with multi-backend keras. Is it actually using float32, rather than the new TF32? Which keras implementation is doing the right thing?