microsoft/Olive

[Bug] Optimization step for unet fails after 'Protobuf parsing failed'

mhalhamdan opened this issue · 10 comments

Describe the bug
'Protobuf parsing failed' when running optimization from example in https://github.com/microsoft/Olive/tree/main/examples/directml/stable_diffusion_xl also specified in details the steps I took in the reproduce section.

The error is followed by:

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\.\\.\\.\\Olive\\examples\\directml\\stable_diffusion_xl\\footprints\\unet_gpu-dml_footprints.json'

Which might be because of the earlier parsing failure.

To Reproduce
System specs: Windows 10, AMD Radeon 7900XT

// Install at root
conda create -n olive-env python=3.11.7
conda activate olive-env
git clone https://github.com/microsoft/Olive.git
cd Olive
python -m pip install .

// Go to specific examples
cd examples/directml/stablediffusion_xl
pip install -r requirements.txt
pip install -r requirements-common.txt

python stable_diffusion_xl.py --model_id=stabilityai/stable-diffusion-xl-base-1.0 --optimize

Expected behavior
For the optimization to finish with an error is not found in the README's issues section: https://github.com/microsoft/Olive/tree/main/examples/directml/stable_diffusion_xl#issues

Olive config
From examples/directml/stable_diffusion_xl/config.py

vae_sample_size = 1024
unet_sample_size = 128
cross_attention_dim = 2048
time_ids_size = 6

Olive logs

[2024-02-02 13:11:41,425] [INFO] [engine.py:934:_run_pass] Pass optimize:OrtTransformersOptimization finished in 390.3980829715729 seconds
[2024-02-02 13:11:43,678] [WARNING] [engine.py:361:run_accelerator] Failed to run Olive on gpu-dml: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from C:\Users\malhamdan\thisonewillwork\Olive\examples\directml\stable_diffusion_xl\cache\models\5_OrtTransformersOptimization-4-6159264963b26d83715d2a73c3b9bf1d-gpu-dml\output_model\model.onnx failed:Protobuf parsing failed.
Traceback (most recent call last):
  File "C:\Users\malhamdan\miniconda3\envs\thisonewillwork\Lib\site-packages\olive\engine\engine.py", line 350, in run_accelerator
    return self.run_search(
           ^^^^^^^^^^^^^^^^
  File "C:\Users\malhamdan\miniconda3\envs\thisonewillwork\Lib\site-packages\olive\engine\engine.py", line 521, in run_search
    should_prune, signal, model_ids = self._run_passes(
                                      ^^^^^^^^^^^^^^^^^
  File "C:\Users\malhamdan\miniconda3\envs\thisonewillwork\Lib\site-packages\olive\engine\engine.py", line 834, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\malhamdan\miniconda3\envs\thisonewillwork\Lib\site-packages\olive\engine\engine.py", line 1024, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\malhamdan\miniconda3\envs\thisonewillwork\Lib\site-packages\olive\systems\local.py", line 49, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\malhamdan\miniconda3\envs\thisonewillwork\Lib\site-packages\olive\evaluator\olive_evaluator.py", line 215, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
                               ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\malhamdan\miniconda3\envs\thisonewillwork\Lib\site-packages\olive\evaluator\olive_evaluator.py", line 133, in _evaluate_latency
    latencies = self._evaluate_raw_latency(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\malhamdan\miniconda3\envs\thisonewillwork\Lib\site-packages\olive\evaluator\olive_evaluator.py", line 797, in _evaluate_raw_latency
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\malhamdan\miniconda3\envs\thisonewillwork\Lib\site-packages\olive\evaluator\olive_evaluator.py", line 539, in _evaluate_onnx_latency
    session = model.prepare_session(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\malhamdan\miniconda3\envs\thisonewillwork\Lib\site-packages\olive\model\handler\onnx.py", line 110, in prepare_session
    session = get_ort_inference_session(self.model_path, inference_settings, self.use_ort_extensions)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\malhamdan\miniconda3\envs\thisonewillwork\Lib\site-packages\olive\common\ort_inference.py", line 73, in get_ort_inference_session
    return ort.InferenceSession(
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\malhamdan\miniconda3\envs\thisonewillwork\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\Users\malhamdan\miniconda3\envs\thisonewillwork\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 472, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from C:\Users\malhamdan\thisonewillwork\Olive\examples\directml\stable_diffusion_xl\cache\models\5_OrtTransformersOptimization-4-6159264963b26d83715d2a73c3b9bf1d-gpu-dml\output_model\model.onnx failed:Protobuf parsing failed.
[2024-02-02 13:11:43,682] [INFO] [engine.py:284:run] Run history for gpu-dml:
[2024-02-02 13:11:43,684] [INFO] [engine.py:559:dump_run_history] run history:
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| model_id                                                                           | parent_model_id                                                                    | from_pass                   |   duration_sec | metrics   |
+====================================================================================+====================================================================================+=============================+================+===========+
| 77fd9520c41098b645a32fa4a0be48c6                                                   |                                                                                    |                             |                |           |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 4_OnnxConversion-77fd9520c41098b645a32fa4a0be48c6-e451b14a9eea094ea8ccf94792609d0b | 77fd9520c41098b645a32fa4a0be48c6                                                   | OnnxConversion              |        176.204 |           |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 5_OrtTransformersOptimization-4-6159264963b26d83715d2a73c3b9bf1d-gpu-dml           | 4_OnnxConversion-77fd9520c41098b645a32fa4a0be48c6-e451b14a9eea094ea8ccf94792609d0b | OrtTransformersOptimization |        390.398 |           |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
[2024-02-02 13:11:43,685] [INFO] [engine.py:298:run] No packaging config provided, skip packaging artifacts
Traceback (most recent call last):
  File "C:\Users\malhamdan\thisonewillwork\Olive\examples\directml\stable_diffusion_xl\stable_diffusion_xl.py", line 635, in <module>
    main()
  File "C:\Users\malhamdan\thisonewillwork\Olive\examples\directml\stable_diffusion_xl\stable_diffusion_xl.py", line 601, in main
    optimize(
  File "C:\Users\malhamdan\thisonewillwork\Olive\examples\directml\stable_diffusion_xl\stable_diffusion_xl.py", line 371, in optimize
    with footprints_file_path.open("r") as footprint_file:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\malhamdan\miniconda3\envs\thisonewillwork\Lib\pathlib.py", line 1044, in open
    return io.open(self, mode, buffering, encoding, errors, newline)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\malhamdan\\thisonewillwork\\Olive\\examples\\directml\\stable_diffusion_xl\\footprints\\unet_gpu-dml_footprints.json'

Other information

  • OS: Windows 10
  • Olive version: main
  • ONNXRuntime package and version: onnx==1.15.0
    onnxruntime-directml==1.17.0

Additional context
I also tried passing --tempdir . but I still got the same error log. Tried to retry multiple new conda environments and fresh installs, and deleting the cache and cloning a fresh copy but same problem.

what's your protobuf version?

BTW, could you please check whether the model file is valid or not? could it be open by https://netron.app/?

@guotuofeng

Protobuf version : protobuf==3.20.3

When I tried to load the same model file on the website you linked it failed as well with: Error loading ONNX model. File format is not onnx.ModelProto (Array buffer allocation failed).

Does this mean the file is corrupted?

Rest of pip freeze from the conda environment:

accelerate==0.26.1
aiohttp==3.9.3
aiosignal==1.3.1
alembic==1.13.1
annotated-types==0.6.0
attrs==23.2.0
certifi==2023.11.17
charset-normalizer==3.3.2
colorama==0.4.6
coloredlogs==15.0.1
colorlog==6.8.2
datasets==2.16.1
diffusers==0.26.0
dill==0.3.7
filelock==3.13.1
flatbuffers==23.5.26
frozenlist==1.4.1
fsspec==2023.10.0
greenlet==3.0.3
huggingface-hub==0.20.3
humanfriendly==10.0
idna==3.6
importlib-metadata==7.0.1
invisible-watermark==0.2.0
Jinja2==3.1.3
lightning-utilities==0.10.1
Mako==1.3.2
MarkupSafe==2.1.4
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.15
networkx==3.2.1
numpy==1.26.3
olive-ai @ file:///C:/Users/malhamdan/thisonewillwork/Olive
onnx==1.15.0
onnxruntime-directml==1.17.0
opencv-python==4.9.0.80
optimum==1.16.2
optuna==3.5.0
packaging==23.2
pandas==2.2.0
pillow==10.2.0
protobuf==3.20.3
psutil==5.9.8
pyarrow==15.0.0
pyarrow-hotfix==0.6
pydantic==2.6.0
pydantic_core==2.16.1
pyreadline3==3.4.1
python-dateutil==2.8.2
pytz==2023.4
PyWavelets==1.5.0
PyYAML==6.0.1
regex==2023.12.25
requests==2.31.0
safetensors==0.4.2
sentencepiece==0.1.99
six==1.16.0
SQLAlchemy==2.0.25
sympy==1.12
tabulate==0.9.0
tokenizers==0.15.1
torch==2.2.0
torchmetrics==1.3.0.post0
tqdm==4.66.1
transformers==4.37.2
typing_extensions==4.9.0
tzdata==2023.4
urllib3==2.2.0
xxhash==3.4.1
yarl==1.9.4
zipp==3.17.0

I tried reinstalling https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0 directly locally using huggingface-cli to make sure there is no corruption, and I then modified the stable_diffusion_xl.py to point to my local model, but I am still getting the same protobuf error.

Does this mean the file is corrupted?

If the model file cannot be loaded by the netron app, it means the model file is corrupted.

Does this mean the file is corrupted?

If the model file cannot be loaded by the netron app, it means the model file is corrupted.

I download the sdxl multiple times using huggingface_cli, and also using the automatic download provided by the Olive python code to ensure there is no prior corruption, and it all leads to the same error. This means Olive is not transforming the model to a proper .onnx model correctly? What can be done?

Thank you.

it seems similar with microsoft/onnxruntime#10892? what's the model size?

I faced the same issue, how to solve it

I faced the same issue, how to solve it

what's your model size?

PR #1069 should fix the issue that was specific to python 3.11 when saving large models.

Please reopen the issue if the problem persists.