[Bug]: Optimization of Unet fails - AMD RDNA3.5 Strix Point Processor
woonyee28 opened this issue · 2 comments
Describe the bug
For context, I am using AMD RDNA3.5 architecture, Strix Point Processor.
Under Olive\examples\stable_diffusion
I ran the following command.
python stable_diffusion.py --model_id stabilityai/stable-diffusion-2-1 --optimize --clean_cache
I encountered error while I was optimizing unet.
Same error was found when I ran python stable_diffusion_xl.py --model_id stabilityai/sdxl-turbo --optimize --clean_cache
under Olive\examples\directml\stable_diffusion_xl
The error is: onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlCommandRecorder.cpp(371)\onnxruntime_pybind11_state.pyd!00007FFA29BD2C5E: (caller: 00007FFA29BB9864) Exception(1) tid(35c4) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.
I have scrolled through the issue list. Multiple approaches have been taken. For example:
- Downgrading Python to 3.10, previously I was using Python 3.12. (Suggested from this issue: #1023)
- Set
"save_as_external_data": true
(Suggested from the same issue as above: #1023) - Set --temp-dir . (Suggested from the same issue as above: #1023)
None of these works.
I found the exact same problem faced by this issue: #517
Other information
- OS: Windows
- Olive version: 0.6.0 (git clone main branch on 21 May Singapore Time)
- ONNXRuntime package and version: onnxruntime-directml 1.18.0
For full error log:
Optimizing unet
[2024-05-22 15:44:40,729] [INFO] [run.py:279:run] Loading Olive module configuration from: C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\olive_config.json
[2024-05-22 15:44:40,734] [DEBUG] [olive_evaluator.py:1153:validate_metrics] No priority is specified, but only one sub type metric is specified. Use rank 1 for single for this metric.
[2024-05-22 15:44:40,734] [DEBUG] [run.py:173:run_engine] Registering pass OnnxConversion
[2024-05-22 15:44:40,734] [DEBUG] [run.py:173:run_engine] Registering pass OrtTransformersOptimization
[2024-05-22 15:44:40,734] [DEBUG] [accelerator_creator.py:130:_fill_accelerators] The accelerator device and execution providers are specified, skipping deduce.
[2024-05-22 15:44:40,734] [DEBUG] [accelerator_creator.py:169:_check_execution_providers] Supported execution providers for device gpu: ['DmlExecutionProvider', 'CPUExecutionProvider']
[2024-05-22 15:44:40,734] [DEBUG] [accelerator_creator.py:199:create_accelerators] Initial accelerators and execution providers: {'gpu': ['DmlExecutionProvider']}
[2024-05-22 15:44:40,734] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OnnxConversion already registered
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OpenVINOConversion already registered
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OrtTransformersOptimization already registered
[2024-05-22 15:44:40,734] [DEBUG] [run.py:229:run_engine] Pass OrtTransformersOptimization already registered
[2024-05-22 15:44:40,734] [INFO] [engine.py:107:initialize] Using cache directory: cache
[2024-05-22 15:44:40,734] [INFO] [engine.py:263:run] Running Olive on accelerator: gpu-dml
[2024-05-22 15:44:40,734] [INFO] [engine.py:1075:_create_system] Creating target system ...
[2024-05-22 15:44:40,734] [DEBUG] [engine.py:1071:create_system] create native OliveSystem SystemType.Local
[2024-05-22 15:44:40,742] [INFO] [engine.py:1078:_create_system] Target system created in 0.007994 seconds
[2024-05-22 15:44:40,742] [INFO] [engine.py:1087:_create_system] Creating host system ...
[2024-05-22 15:44:40,742] [DEBUG] [engine.py:1071:create_system] create native OliveSystem SystemType.Local
[2024-05-22 15:44:40,742] [INFO] [engine.py:1090:_create_system] Host system created in 0.000000 seconds
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:709:_cache_model] Cached model 9c464b7b to cache\models\9c464b7b.json
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:336:run_accelerator] Running Olive in no-search mode ...
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:428:run_no_search] Running ['convert', 'optimize'] with no search ...
[2024-05-22 15:44:40,764] [INFO] [engine.py:865:_run_pass] Running pass convert:OnnxConversion
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:884:_run_pass] Loading model from cache ...
[2024-05-22 15:44:40,764] [INFO] [engine.py:899:_run_pass] Loaded model from cache: 6_OnnxConversion-9c464b7b-89c11e05 from cache\runs
[2024-05-22 15:44:40,764] [INFO] [engine.py:865:_run_pass] Running pass optimize:OrtTransformersOptimization
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:884:_run_pass] Loading model from cache ...
[2024-05-22 15:44:40,764] [INFO] [engine.py:899:_run_pass] Loaded model from cache: 12_OrtTransformersOptimization-6-b768c232-gpu-dml from cache\runs
[2024-05-22 15:44:40,764] [INFO] [engine.py:843:_run_passes] Run model evaluation for the final model...
[2024-05-22 15:44:40,764] [DEBUG] [engine.py:1016:_evaluate_model] Evaluating model ...
[2024-05-22 15:44:40,764] [DEBUG] [resource_path.py:156:create_resource_path] Resource path C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\cache\models\12_OrtTransformersOptimization-6-b768c232-gpu-dml\output_model is inferred to be of type folder.
[2024-05-22 15:44:40,764] [DEBUG] [resource_path.py:156:create_resource_path] Resource path C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\cache\models\12_OrtTransformersOptimization-6-b768c232-gpu-dml\output_model is inferred to be of type folder.
[2024-05-22 15:44:40,779] [DEBUG] [olive_evaluator.py:238:generate_metric_user_config_with_model_io] Model input shapes are not static. Cannot use inferred input shapes for creating dummy data. This will cause an error when creating dummy data for tuning.
[2024-05-22 15:44:40,779] [DEBUG] [ort_inference.py:72:get_ort_inference_session] inference_settings: {'execution_provider': ['DmlExecutionProvider'], 'provider_options': None}
[2024-05-22 15:44:40,779] [DEBUG] [ort_inference.py:111:get_ort_inference_session] Normalized providers: ['DmlExecutionProvider'], provider_options: [{}]
[2024-05-22 15:44:57,498] [WARNING] [engine.py:358:run_accelerator] Failed to run Olive on gpu-dml.
Traceback (most recent call last):
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 337, in run_accelerator
output_footprint = self.run_no_search(
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 429, in run_no_search
should_prune, signal, model_ids = self._run_passes(
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 844, in _run_passes
signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\engine\engine.py", line 1042, in _evaluate_model
signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\systems\local.py", line 47, in evaluate_model
return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 205, in evaluate
metrics_res[metric.name] = self._evaluate_latency(
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 123, in _evaluate_latency
latencies = self._evaluate_raw_latency(
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 762, in _evaluate_raw_latency
return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\evaluator\olive_evaluator.py", line 543, in _evaluate_onnx_latency
latencies = session.time_run(
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\olive\common\ort_inference.py", line 334, in time_run
self.session.run(input_feed=input_feed, output_names=None)
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 220, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlCommandRecorder.cpp(371)\onnxruntime_pybind11_state.pyd!00007FFA29BD2C5E: (caller: 00007FFA29BB9864) Exception(1) tid(35c4) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.
[2024-05-22 15:44:58,009] [INFO] [engine.py:280:run] Run history for gpu-dml:
[2024-05-22 15:44:58,009] [INFO] [engine.py:570:dump_run_history] Please install tabulate for better run history output
[2024-05-22 15:44:58,009] [INFO] [engine.py:295:run] No packaging config provided, skip packaging artifacts
Traceback (most recent call last):
File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\stable_diffusion.py", line 433, in <module>
main()
File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\stable_diffusion.py", line 370, in main
optimize(common_args.model_id, common_args.provider, unoptimized_model_dir, optimized_model_dir)
File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\stable_diffusion.py", line 253, in optimize
save_optimized_onnx_submodel(submodel_name, provider, model_info)
File "C:\Users\wy-te\OneDrive\Desktop\Projects\Olive\examples\stable_diffusion\sd_utils\ort.py", line 59, in save_optimized_onnx_submodel
with footprints_file_path.open("r") as footprint_file:
File "C:\Users\wy-te\AppData\Local\Programs\Python\Python310\lib\pathlib.py", line 1119, in open
return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\wy-te\\OneDrive\\Desktop\\Projects\\Olive\\examples\\stable_diffusion\\footprints\\unet_gpu-dml_footprints.json'
This error might be related to DXGI_ERROR_DEVICE_HUNG
?
@jstoecker, @guotuofeng Would love to hear insights from yall, thanks!
@PatriceVignola, do you have any idea?
Set registry TdrLevel = 0
SDXL need to setup paging file around 150GB.
and optimization may trigger TDR timeout event.