sample quantize_inception_v3 run into segmentation fault

Question

sample quantize_inception_v3 run into segmentation fault

YanfeiXu opened this issue a year ago · 4 comments

OS: ubuntu23.04 docker container
Hardware: Xeon icelake
Components installed: oneAPI base toolkit, python3.9, pip, conda
Problem: Following the steps of sample quantize_inception_v3 however run into segmentation fault with jupyter. Then I converted the quantize_inception_v3.ipynb into .py file, and use ipython to run it, then also see segmentation fault.

(env_itex) (base) root@610430605d50:/intel-extension-for-tensorflow/examples/quantize_inception_v3# pip list |grep tensor
intel-extension-for-tensorflow     2.13.0.0
intel-extension-for-tensorflow-lib 2.13.0.0.0
tensorboard                        2.13.0
tensorboard-data-server            0.7.1
tensorflow                         2.13.0
tensorflow-estimator               2.13.0
tensorflow-io-gcs-filesystem       0.33.0

ipython quantize_inception_v3.py
........
23/23 [==============================] - 2s 79ms/step - loss: 0.3944 - accuracy: 0.8638
INFO:tensorflow:Assets written to: model_keras.fp32/assets
2023-08-09 12:24:52,637 - tensorflow - INFO - Assets written to: model_keras.fp32/assets
Save model to model_keras.fp32
version: 1.0

model:
  name: inception_v3
  framework: tensorflow_itex                         # possible values are tensorflow, mxnet and pytorch

evaluation:
  accuracy:
    metric:
      topk: 1                               # built-in metrics are topk, map, f1, allow user to register new metric.

tuning:
  accuracy_criterion:
    relative: 0.01                             # the tuning target of accuracy loss percentage: 2%
  exit_policy:
    timeout: 0                                   # tuning timeout (seconds)
  random_seed: 100                               # random seed
Found 3670 files belonging to 5 classes.
Using 2936 files for training.
Using 734 files for validation.
2023-08-09 12:24:53.768187: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type CPU is enabled.
2023-08-09 12:24:53.777478: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type CPU is enabled.
2023-08-09 12:25:07 [WARNING] Output tensor names should not be empty.
2023-08-09 12:25:07 [WARNING] Input tensor names is empty.
INFO:tensorflow:Assets written to: /tmp/tmps4ia3yd9/assets
2023-08-09 12:25:25,183 - tensorflow - INFO - Assets written to: /tmp/tmps4ia3yd9/assets
2023-08-09 12:25:31.852167: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2023-08-09 12:25:31.852320: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2023-08-09 12:25:34.215201: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2023-08-09 12:25:34.215403: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2023-08-09 12:25:36.411005: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2023-08-09 12:25:36.411114: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2023-08-09 12:25:56 [INFO] ConvertLayoutOptimizer elapsed time: 0.41 ms
2023-08-09 12:25:56 [INFO] Pass ConvertPlaceholderToConst elapsed time: 33.94 ms
2023-08-09 12:25:56 [INFO] Pass SwitchOptimizer elapsed time: 32.33 ms
Segmentation fault (core dumped)

Answer 1 · 2023-08-10T00:40:26.000Z

@YanfeiXu can you help to try on ubuntu 22 to check if this issue from OS side or ITEX side, thanks.

Answer 2 · 2023-08-13T02:13:45.000Z

@YanfeiXu can you help to try on ubuntu 22 to check if this issue from OS side or ITEX side, thanks.

Hi, I checked it. It also can easily reproduce on ubuntu22.

evaluation:
  accuracy:
    metric:
      topk: 1                               # built-in metrics are topk, map, f1, allow user to register new metric.

tuning:
  accuracy_criterion:
    relative: 0.01                             # the tuning target of accuracy loss percentage: 2%
  exit_policy:
    timeout: 0                                   # tuning timeout (seconds)
  random_seed: 100                               # random seed
Found 3670 files belonging to 5 classes.
Using 2936 files for training.
Using 734 files for validation.
2023-08-13 02:06:37.926874: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type CPU is enabled.
2023-08-13 02:06:37.936141: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type CPU is enabled.
2023-08-13 02:06:50 [WARNING] Output tensor names should not be empty.
2023-08-13 02:06:50 [WARNING] Input tensor names is empty.
INFO:tensorflow:Assets written to: /tmp/tmp9rt961oq/assets
2023-08-13 02:07:08,099 - tensorflow - INFO - Assets written to: /tmp/tmp9rt961oq/assets
2023-08-13 02:07:14.739487: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2023-08-13 02:07:14.739638: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2023-08-13 02:07:16.971706: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2023-08-13 02:07:16.971927: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2023-08-13 02:07:19.531589: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2023-08-13 02:07:19.531719: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2023-08-13 02:07:39 [INFO] ConvertLayoutOptimizer elapsed time: 0.41 ms
2023-08-13 02:07:39 [INFO] Pass ConvertPlaceholderToConst elapsed time: 34.35 ms
2023-08-13 02:07:39 [INFO] Pass SwitchOptimizer elapsed time: 30.84 ms
Segmentation fault (core dumped)
(itex_build) root@98026ad8adbc:/intel-extension-for-tensorflow/examples/quantize_inception_v3# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
(itex_build) root@98026ad8adbc:/intel-extension-for-tensorflow/examples/quantize_inception_v3#

Answer 3 · 2023-08-28T02:33:51.000Z

@YanfeiXu Segmentation fault (core dumped) issue can not be reproduced with latest packages on Ubuntu22.04.

$ pip list | grep tensorflow
intel-extension-for-tensorflow     2.13.0.0
intel-extension-for-tensorflow-lib 2.13.0.0.0
tensorflow                         2.13.0
tensorflow-estimator               2.13.0
tensorflow-io-gcs-filesystem       0.33.0

$ pip list | grep neural
neural-compressor                  2.2.1

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

This example is still failed for AssertionError exposed by neural_compressor, since it is based on neural_compressor old APIs (v1.XX), which are not applicable for the latest neural_compressor (v2.xx). @NeoZhangJianyu will update this example in our coming 2.14.0 release.
AssertionError:

File "/home/zhefengq/WORKSPACE/intel-extension-for-tensorflow-master/examples/quantize_inception_v3/env_itex/lib/python3.9/site-packages/neural_compressor/metric/metric.py", line 924, in update
    preds, labels = _topk_shape_validate(preds, labels)
  File "/home/zhefengq/WORKSPACE/intel-extension-for-tensorflow-master/examples/quantize_inception_v3/env_itex/lib/python3.9/site-packages/neural_compressor/metric/metric.py", line 458, in _topk_shape_validate
    assert label_N == N, 'labels batch size should same with preds'
AssertionError: labels batch size should same with preds
2023-08-28 10:04:48 [ERROR] Specified timeout or max trials is reached! Not found any quantized model which meet accuracy goal. Exit.

Answer 4 · 2023-08-29T08:55:28.000Z

@Dboyqiao Got it, thanks for the clarification.