S3 filesystem pure virtual method called; terminate called without an active exception

Question

S3 filesystem pure virtual method called; terminate called without an active exception

rivershah opened this issue 5 months ago · 12 comments

I am getting a core dump during interpreter teardown, when using the s3 filesystem. Can I please be given guidance how to handle this issue. Please see script to reproduce inside docker:

FROM tensorflow/tensorflow:2.14.0-gpu

The following environment variables are set

"AWS_ACCESS_KEY_ID": xxx,
"AWS_SECRET_ACCESS_KEY": xxx,
"AWS_ENDPOINT_URL_S3": xxx,
"AWS_REGION": "us-east-1",
"S3_USE_HTTPS": "1",
"S3_VERIFY_SSL": "1",
"S3_DISABLE_MULTI_PART_DOWNLOAD": "0",
"S3_ENDPOINT": xxx,

import os

import tensorflow as tf
import tensorflow_io as tfio

def illustrate_core_dump():
    print(f"tf version: {tf.__version__}")
    print(f"tfio version: {tfio.__version__}")
    filename = f"{os.environ['CLOUD_MOUNT']}/tmp/test_tfrecord.tfrecord"
    assert filename.startswith("s3://"), "problem appears to be be for s3 filesystem only"
    ds = tf.data.TFRecordDataset(filename, "GZIP")

    for i in ds:
        print(f"i.shape: {i.shape}")


if __name__ == "__main__":
    illustrate_core_dump()
    print("reaches here successfully")
    print("something broken during destruction and tf")

    # during interpreter teardown if s3 filesystem used we will get
    # pure virtual method called
    # terminate called without an active exception
    # Aborted (core dumped)

    # gs:// and file:// do not exhibit this issue which don't rely on tfio

TF_CPP_MIN_LOG_LEVEL=0 python notebooks/illustrate_core_dump.py 
2024-01-01 18:07:11.253238: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-01 18:07:11.253287: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-01 18:07:11.253323: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-01 18:07:11.262384: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
tf version: 2.14.0
tfio version: 0.35.0
2024-01-01 18:07:14.402239: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.413303: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.416545: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.421598: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.423868: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:14.426098: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.494277: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.496519: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.498484: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-01 18:07:15.500342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13589 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
i.shape: ()
reaches here successfully
something broken during destruction and tf
pure virtual method called
terminate called without an active exception
Aborted (core dumped)

Answer 1 · 2024-01-04T15:26:09.000Z

tensorflow-io==0.34.0 # works
tensorflow-io==0.35.0 # crashing

Can we please verify why the latest so exhibiting this issue. Thank you

Answer 2 · 2024-01-29T21:59:47.000Z

I had the same issue and it was driving me insane. I have some unrelated custom c++ ops and wasted a day digging into those. I am using s3 and going back to 0.34.0 fixed it.

Answer 3 · 2024-02-06T00:21:56.000Z

Facing the same issue but for tensorflow==2.13, with tensorflow-io==0.34.0 (and with tensorflow-io==0.35.0). There is no straightforward root-cause, and reverting to tensorflow-io==0.33.0 fixes it.

I've also faced the same error with tensorflow==2.14, with tensorflow-io==0.35.0, which is the only version that supports TF 2.14 as per the compatibility chart on the README.md. But reverting to tensorflow-io==0.33.0 seems to fix it.

Answer 4 · 2024-02-12T23:32:05.000Z

As an update, I followed the build instructions for tensorflow-io (Ubuntu 22.04 and then Python Wheels), and discovered that this particular pure virtual method called error does not occur when I use a locally built wheel for tensorflow-io.

Note: The link in the docker build instructions is broken - https://github.com/tensorflow/io/blob/master/docs/development.md#docker - and the latest image in tfsigio/tfio is about 2 years old.

Answer 5 · 2024-02-13T06:45:09.000Z

@saimi Is there any chance you can please post the steps you took to build? I tried to build but was thwarted by the issues you mentioned.

Answer 6 · 2024-02-13T23:35:50.000Z

@rivershah I pulled the ubuntu:22.04 image from dockerhub

docker run --name tfio_builder -itd ubuntu:22.04 bash
docker exec -it tfio_builder bash

and installed all the packages and bazel as instructed in https://github.com/tensorflow/io/blob/master/docs/development.md#ubuntu-2204 (without the sudo)

apt-get -y -qq update
apt-get -y -qq install gcc g++ git unzip curl python3-pip python-is-python3 libntirpc-dev
curl -sSOL https://github.com/bazelbuild/bazelisk/releases/download/v1.11.0/bazelisk-linux-amd64
mv bazelisk-linux-amd64 /usr/local/bin/bazel
chmod +x /usr/local/bin/bazel

python3 --version  # made sure I had python version>=3.9
python3 -m pip install -U pip
git clone https://github.com/tensorflow/io
cd io/
git checkout v0.35.0
pip install "tensorflow==2.14.1"
./configure.sh
export TF_PYTHON_VERSION=3.10
bazel build -s --verbose_failures --copt="-Wno-error=array-parameter=" --copt="-I/usr/include/tirpc" //tensorflow_io/... //tensorflow_io_gcs_filesystem/...

I then followed the instructions at https://github.com/tensorflow/io/blob/master/docs/development.md#python-wheels:

python3 setup.py bdist_wheel --data bazel-bin

Then, within the same container, I was able to validate tf-io's S3 filesystem functionality by trying to checkpoint a model to S3.

I'll need to do some additional work to reproduce the failure I got when copying the generated tf-io wheel out into a different container, since I've terminated all of that setup now.

Answer 7 · 2024-04-05T04:47:57.000Z

Bumping this issue. Needs looking at to ensure build process handling correctly

Answer 8 · 2024-05-14T11:46:57.000Z

This problem persists in tensorflow-io==0.37.0 Please fix, this is rendering s3 based io unusable without resorting to old versions

Answer 9 · 2024-05-24T15:00:00.000Z

@yongtang would you be able to help here? Sounds like this is a pretty serious issue, so it would be much appreciated!!

Answer 10 · 2024-05-29T19:23:31.000Z

This is blocking us from upgrading the tensorstore version. A quick fix will be much appreciated!

Answer 11 · 2024-05-30T21:22:24.000Z

+1, also running into this issue

Answer 12 · 2024-05-30T23:00:59.000Z

@yongtang would you be able to help here? Sounds like this is a pretty serious issue, so it would be much appreciated!!

@yongtang per the comment #1912 (comment) above, assuming my PR #2005 passes can you please consider a minor release (0.37.1 maybe?) to address the S3 issues discussed above. Thanks!