[Bug]: Optimization of Unet fails - AMD RDNA3.5 Strix Point Processor

Question

[Bug]: Optimization of Unet fails - AMD RDNA3.5 Strix Point Processor

woonyee28 opened this issue 8 months ago · 2 comments

Describe the bug

For context, I am using AMD RDNA3.5 architecture, Strix Point Processor.

Under Olive\examples\stable_diffusion I ran the following command.
python stable_diffusion.py --model_id stabilityai/stable-diffusion-2-1 --optimize --clean_cache
I encountered error while I was optimizing unet.
Same error was found when I ran python stable_diffusion_xl.py --model_id stabilityai/sdxl-turbo --optimize --clean_cache under Olive\examples\directml\stable_diffusion_xl

The error is: onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlCommandRecorder.cpp(371)\onnxruntime_pybind11_state.pyd!00007FFA29BD2C5E: (caller: 00007FFA29BB9864) Exception(1) tid(35c4) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.

I have scrolled through the issue list. Multiple approaches have been taken. For example:

Downgrading Python to 3.10, previously I was using Python 3.12. (Suggested from this issue: #1023)
Set "save_as_external_data": true (Suggested from the same issue as above: #1023)
Set --temp-dir . (Suggested from the same issue as above: #1023)
None of these works.

I found the exact same problem faced by this issue: #517

Other information

OS: Windows
Olive version: 0.6.0 (git clone main branch on 21 May Singapore Time)
ONNXRuntime package and version: onnxruntime-directml 1.18.0

For full error log:

Optimizing unet
[2024-05-22 15:44:40,729] [INFO] [run.py:279:run] Loading Olive module configuration from: C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\olive_config.json
[2024-05-22 15:44:40,734] [DEBUG] [olive_evaluator.py:1153:validate_metrics] No priority is specified, but only one sub type  metric is specified. Use rank 1 for single for this metric.
[2024-05-22 15:44:40,734] [DEBUG] [run.py:173:run_engine] Registering pass OnnxConversion
[2024-05-22 15:44:40,734] [DEBUG] [run.py:173:run_engine] Registering pass OrtTransformersOptimization
[2024-05-22 15:44:40,734] [DEBUG] [accelerator_creator.py:130:_fill_accelerators] The accelerator device and execution providers are specified, skipping deduce.
[2024-05-22 15:44:40,734] [DEBUG] [accelerator_creator.py:169:_check_execution_providers] Supported execution providers for device gpu: ['DmlExecutionProvider', 'CPUExecutionProvider']
[2024-05-22 15:44:40,734] [DEBUG] [accelerator_creator.py:199:create_accelerators] Initial accelerators and execution providers: {'gpu': ['DmlExecutionProvider']}
[2024-05-22 15:44:40,734] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OnnxConversion already registered
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OpenVINOConversion already registered
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OrtTransformersOptimization already registered
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OrtTransformersOptimization already registered
[2024-05-22 15:44:40,734] [INFO] [engine.py:107:initialize] Using cache directory: cache
[2024-05-22 15:44:40,734] [INFO] [engine.py:263:run] Running Olive on accelerator: gpu-dml
[2024-05-22 15:44:40,734] [INFO] [engine.py:1075:_create_system] Creating target system ...
[2024-05-22 15:44:40,734] [DEBUG] [engine.py:1071:create_system] create native OliveSystem SystemType.Local
[2024-05-22 15:44:40,742] [INFO] [engine.py:1078:_create_system] Target system created in 0.007994 seconds
[2024-05-22 15:44:40,742] [INFO] [engine.py:1087:_create_system] Creating host system ...
[2024-05-22 15:44:40,742] [DEBUG] [engine.py:1071:create_system] create native OliveSystem SystemType.Local
[2024-05-22 15:44:40,742] [INFO] [engine.py:1090:_create_system] Host system created in 0.000000 seconds
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:709:_cache_model] Cached model 9c464b7b to cache\models\9c464b7b.json
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:336:run_accelerator] Running Olive in no-search mode ...
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:428:run_no_search] Running ['convert', 'optimize'] with no search ...
[2024-05-22 15:44:40,764] [INFO] [engine.py:865:_run_pass] Running pass convert:OnnxConversion
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:884:_run_pass] Loading model from cache ...
[2024-05-22 15:44:40,764] [INFO] [engine.py:899:_run_pass] Loaded model from cache: 6_OnnxConversion-9c464b7b-89c11e05 from cache\runs
[2024-05-22 15:44:40,764] [INFO] [engine.py:865:_run_pass] Running pass optimize:OrtTransformersOptimization
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:884:_run_pass] Loading model from cache ...
[2024-05-22 15:44:40,764] [INFO] [engine.py:899:_run_pass] Loaded model from cache: 12_OrtTransformersOptimization-6-b768c232-gpu-dml from cache\runs       
[2024-05-22 15:44:40,764] [INFO] [engine.py:843:_run_passes] Run model evaluation for the final model...
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:1016:_evaluate_model] Evaluating model ...
[2024-05-22 15:44:40,764] [DEBUG] [resource_path.py:156:create_resource_path] Resource path C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\cache\models\12_OrtTransformersOptimization-6-b768c232-gpu-dml\output_model is inferred to be of type folder.
[2024-05-22 15:44:40,764] [DEBUG] [resource_path.py:156:create_resource_path] Resource path C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\cache\models\12_OrtTransformersOptimization-6-b768c232-gpu-dml\output_model is inferred to be of type folder.
[2024-05-22 15:44:40,779] [DEBUG] [olive_evaluator.py:238:generate_metric_user_config_with_model_io] Model input shapes are not static. Cannot use inferred input shapes for creating dummy data. This will cause an error when creating dummy data for tuning.
[2024-05-22 15:44:40,779] [DEBUG] [ort_inference.py:72:get_ort_inference_session] inference_settings: {'execution_provider': ['DmlExecutionProvider'], 'provider_options': None}
[2024-05-22 15:44:40,779] [DEBUG] [ort_inference.py:111:get_ort_inference_session] Normalized providers: ['DmlExecutionProvider'], provider_options: [{}]   
[2024-05-22 15:44:57,498] [WARNING] [engine.py:358:run_accelerator] Failed to run Olive on gpu-dml.
Traceback (most recent call last):
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 337, in run_accelerator
    output_footprint = self.run_no_search(
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 429, in run_no_search
    should_prune, signal, model_ids = self._run_passes(
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 844, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 1042, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\systems\local.py", line 47, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 205, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 123, in _evaluate_latency        
    latencies = self._evaluate_raw_latency(
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 762, in _evaluate_raw_latency    
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 543, in _evaluate_onnx_latency   
    latencies = session.time_run(
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\common\ort_inference.py", line 334, in time_run
    self.session.run(input_feed=input_feed, output_names=None)
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 220, in run    
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlCommandRecorder.cpp(371)\onnxruntime_pybind11_state.pyd!00007FFA29BD2C5E: (caller: 00007FFA29BB9864) Exception(1) tid(35c4) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.

[2024-05-22 15:44:58,009] [INFO] [engine.py:280:run] Run history for gpu-dml:
[2024-05-22 15:44:58,009] [INFO] [engine.py:570:dump_run_history] Please install tabulate for better run history output
[2024-05-22 15:44:58,009] [INFO] [engine.py:295:run] No packaging config provided, skip packaging artifacts
Traceback (most recent call last):
  File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\stable_diffusion.py", line 433, in <module>
    main()
  File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\stable_diffusion.py", line 370, in main
    optimize(common_args.model_id, common_args.provider, unoptimized_model_dir, optimized_model_dir)
  File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\stable_diffusion.py", line 253, in optimize
    save_optimized_onnx_submodel(submodel_name, provider, model_info)
  File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\sd_utils\ort.py", line 59, in save_optimized_onnx_submodel
    with footprints_file_path.open("r") as footprint_file:
  File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\wy-te\\OneDrive\\Desktop\\Projects\\Olive\\examples\\stable_diffusion\\footprints\\unet_gpu-dml_footprints.json'

This error might be related to DXGI_ERROR_DEVICE_HUNG?

@jstoecker, @guotuofeng Would love to hear insights from yall, thanks!

Answer 1 · 2024-05-22T08:19:17.000Z

@PatriceVignola, do you have any idea?

Answer 2 · 2024-08-06T08:51:50.000Z

Set registry TdrLevel = 0
SDXL need to setup paging file around 150GB.
and optimization may trigger TDR timeout event.