Failed to run symbolic shape inference when doing LLM Optimization with DirectML
jojo1899 opened this issue · 11 comments
Describe the bug
I am trying to run the code in LLM Optimization with DirectML. The requirements.txt
file says onnxruntime-directml>=1.17.4
. Is there a typo in that? The latest version seems to be onnxruntime-directml 1.17.3
. Executing pip install -r requirements.txt
results in the following error.
ERROR: Could not find a version that satisfies the requirement onnxruntime-directml>=1.17.4 (from versions: 1.9.0, 1.10.0, 1.11.0, 1.11.1, 1.12.0, 1.12.1, 1.13.1, 1.14.0, 1.14.1, 1.15.0, 1.15.1, 1.16.0, 1.16.1, 1.16.2, 1.16.3, 1.17.0, 1.17.1, 1.17.3)
ERROR: No matching distribution found for onnxruntime-directml>=1.17.4
I continued running the code with onnxruntime-directml 1.17.3
. However, the LLM Optimization with DirectML does not run as expected when the following is executed: python llm.py --model_type=mistral-7b-chat
.
It Failed to run symbolic shape inference
. It then Failed to run Olive on gpu-dml
. The Traceback is pasted in the Olive logs below.
To Reproduce
python llm.py --model_type=mistral-7b-chat
Expected behavior
Expected the code to run without any errors
Olive config
Add Olive configurations here.
Olive logs
>python llm.py --model_type=mistral-7b-chat
Optimizing mistralai/Mistral-7B-Instruct-v0.1
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.40s/it]
[2024-04-18 17:21:47,163] [INFO] [run.py:261:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json
[2024-04-18 17:21:47,322] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-04-18 17:21:47,322] [INFO] [engine.py:106:initialize] Using cache directory: cache
[2024-04-18 17:21:47,322] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-dml
[2024-04-18 17:21:47,343] [INFO] [engine.py:864:_run_pass] Running pass convert:OnnxConversion
[2024-04-18 17:28:25,784] [INFO] [engine.py:951:_run_pass] Pass convert:OnnxConversion finished in 398.437406 seconds
[2024-04-18 17:28:25,784] [INFO] [engine.py:864:_run_pass] Running pass optimize:OrtTransformersOptimization
failed in shape inference <class 'AssertionError'>
Failed to run symbolic shape inference. Please file an issue in https://github.com/microsoft/onnxruntime.
failed in shape inference <class 'AssertionError'>
[2024-04-18 17:44:22,031] [INFO] [transformer_optimization.py:420:_replace_mha_with_gqa] Replaced 32 MultiHeadAttention nodes with GroupQueryAttention
[2024-04-18 17:44:34,625] [INFO] [engine.py:951:_run_pass] Pass optimize:OrtTransformersOptimization finished in 968.824505 seconds
[2024-04-18 17:44:34,647] [INFO] [engine.py:842:_run_passes] Run model evaluation for the final model...
[2024-04-18 17:44:35,300] [WARNING] [engine.py:357:run_accelerator] Failed to run Olive on gpu-dml.
Traceback (most recent call last):
File "C:\Olive\olive\engine\engine.py", line 346, in run_accelerator
output_footprint = self.run_search(
File "C:\Olive\olive\engine\engine.py", line 531, in run_search
should_prune, signal, model_ids = self._run_passes(
File "C:\Olive\olive\engine\engine.py", line 843, in _run_passes
signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
File "C:\Olive\olive\engine\engine.py", line 1041, in _evaluate_model
signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
File "C:\Olive\olive\systems\local.py", line 46, in evaluate_model
return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
File "C:\Olive\olive\evaluator\olive_evaluator.py", line 214, in evaluate
metrics_res[metric.name] = self._evaluate_latency(
File "C:\Olive\olive\evaluator\olive_evaluator.py", line 132, in _evaluate_latency
latencies = self._evaluate_raw_latency(
File "C:\Olive\olive\evaluator\olive_evaluator.py", line 767, in _evaluate_raw_latency
return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
File "C:\Olive\olive\evaluator\olive_evaluator.py", line 540, in _evaluate_onnx_latency
session, inference_settings = OnnxEvaluator.get_session_wrapper(
File "C:\Olive\olive\evaluator\olive_evaluator.py", line 435, in get_session_wrapper
session = model.prepare_session(
File "C:\Olive\olive\model\handler\onnx.py", line 114, in prepare_session
return get_ort_inference_session(
File "C:\Olive\olive\common\ort_inference.py", line 118, in get_ort_inference_session
session = ort.InferenceSession(
File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : This is an invalid model. Type Error: Type 'tensor(float)' of input parameter (InsertedPrecisionFreeCast_/model/layers.0/self_attn/rotary_embedding/Add_output_0) of operator (GroupQueryAttention) in node (GroupQueryAttention_0) is invalid.
[2024-04-18 17:44:35,380] [INFO] [engine.py:279:run] Run history for gpu-dml:
[2024-04-18 17:44:35,459] [INFO] [engine.py:567:dump_run_history] run history:
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| model_id | parent_model_id | from_pass | duration_sec | metrics |
+====================================================================================+====================================================================================+=============================+================+===========+
| ce39a7112b2825df5404fbb628c489ab | | | | |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 0_OnnxConversion-ce39a7112b2825df5404fbb628c489ab-dfaff1da61d127bb5e9dc2f31a708897 | ce39a7112b2825df5404fbb628c489ab | OnnxConversion | 398.437 | |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 1_OrtTransformersOptimization-0-d4c4ec660cc893c3eeab183690fc3aca-gpu-dml | 0_OnnxConversion-ce39a7112b2825df5404fbb628c489ab-dfaff1da61d127bb5e9dc2f31a708897 | OrtTransformersOptimization | 968.825 | |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
[2024-04-18 17:44:35,459] [INFO] [engine.py:294:run] No packaging config provided, skip packaging artifacts
Traceback (most recent call last):
File "C:\Olive\examples\directml\llm\llm.py", line 390, in <module>
main()
File "C:\Olive\examples\directml\llm\llm.py", line 350, in main
optimize(
File "C:\Olive\examples\directml\llm\llm.py", line 237, in optimize
with footprints_file_path.open("r") as footprint_file:
File "C:\Anaconda\envs\myolive\lib\pathlib.py", line 1252, in open
return io.open(self, mode, buffering, encoding, errors, newline,
File "C:\Anaconda\envs\myolive\lib\pathlib.py", line 1120, in _opener
return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Olive\\examples\\directml\\llm\\footprints\\mistralai_Mistral-7B-Instruct-v0.1_gpu-dml_footprints.json'
Other information
- OS: Windows 11
- Olive version:
olive-ai 0.6.0
- ONNXRuntime package and version:
onnxruntime-gpu 1.17.1
Additional context
Add any other context about the problem here.
Hi @jojo1899,
This sample requires a future version of onnxruntime-directml (tentatively named 1.17.4 as you've seen in the requirements) to run. This new version should be out very soon and, at the very least, you should be able to use a nightly build soon to run this sample.
@PatriceVignola Thanks for the information.
I tried executing the code again, twice, with different nightly builds: ort-nightly-directml 1.18.0.dev20240117005
(Jan 18 version), and ort-nightly-directml==1.18.0.dev20240417007
(Apr 18 version). I get the same error as with onnxruntime-directml 1.17.3
. Is that strange or as expected?
@jojo1899 Yes, this is expected. You can keep an eye on the 2 following PRs which are required to run this sample:
microsoft/onnxruntime#20308
microsoft/onnxruntime#20327
Once they are merged in (which will 100% be today), it will take one or 2 days to make it into a nightly build. I expect the next nightly build to have the changes. I will update the requirements once that build has been generated.
Hi @jojo1899, we just updated the LLM sample to add the correct version of onnxruntime DirectML to use. You can simpley run
pip install ort-nightly-directml==1.18.0.dev20240419003 --extra-index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/
Note that when converting Mistral, you will still see the failed in shape inference <class 'AssertionError'>
error but it is a false positive (there's a full explanation in the README). The optimization process should still successfully complete unless you run out of memory.
I tried running the code again. I get the following error when quantizing the model using AWQ.
2024-04-22 11:11:21 [INFO] Quantize the model with default config.
Progress: [ ] 0.78%Running model for sample 0
Running model for sample 1
2024-04-22 11:11:55 [ERROR] Unexpected exception Fail('[ONNXRuntimeError] : 1 : FAIL : C:\\a\\_work\\1\\s\\onnxruntime\\core\\providers\\dml\\DmlExecutionProvider\\src\\DmlCommandRecorder.cpp(371)\\onnxruntime_pybind11_state.pyd!00007FFE81F92BFE: (caller: 00007FFE81F79804) Exception(1) tid(3004c) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.\r\n') happened during tuning.
The following are some details.
My GPU: Nvidia 4070 Ti Super
Installed Pytorch using: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
I haven't tried running the model without quantizing it, but I will do that in a while and give an update.
I have a question about the following warning in the log:
[WARNING] Backend `onnxrt_dml_ep` requires a NPU device. Reset device to 'npu'.
Isn't DirectML EP supposed to work with GPUs? Why does it require an NPU?
Here is the log
C:\Olive\examples\directml\llm>python llm.py --model_type=mistral-7b-chat --quant_strategy=awq
Optimizing mistralai/Mistral-7B-Instruct-v0.1
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.20s/it]
[2024-04-22 10:47:03,568] [INFO] [run.py:261:run] Loading Olive module configuration from: C:\Anaconda\envs\myolive\lib\site-packages\olive\olive_config.json
[2024-04-22 10:47:03,727] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-04-22 10:47:03,727] [INFO] [engine.py:106:initialize] Using cache directory: cache
[2024-04-22 10:47:03,727] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-dml
[2024-04-22 10:47:06,491] [INFO] [engine.py:864:_run_pass] Running pass convert:OnnxConversion
[2024-04-22 10:53:13,805] [INFO] [engine.py:951:_run_pass] Pass convert:OnnxConversion finished in 367.314774 seconds
[2024-04-22 10:53:13,817] [INFO] [engine.py:864:_run_pass] Running pass optimize:OrtTransformersOptimization
failed in shape inference <class 'AssertionError'>
Failed to run symbolic shape inference. Please file an issue in https://github.com/microsoft/onnxruntime.
failed in shape inference <class 'AssertionError'>
[2024-04-22 11:06:46,649] [INFO] [transformer_optimization.py:420:_replace_mha_with_gqa] Replaced 32 MultiHeadAttention nodes with GroupQueryAttention
[2024-04-22 11:06:58,912] [INFO] [engine.py:951:_run_pass] Pass optimize:OrtTransformersOptimization finished in 825.090703 seconds
[2024-04-22 11:06:58,928] [INFO] [engine.py:864:_run_pass] Running pass quantize:IncStaticQuantization
[2024-04-22 11:07:02,840] [WARNING] [inc_quantization.py:440:_set_tuning_config] 'metric' is not set for INC Quantization Pass. Intel® Neural Compressor will quantize model without accuracy aware tuning. Please set 'metric' if you want to use Intel® Neural Compressorquantization with accuracy aware tuning.
2024-04-22 11:11:08 [INFO] Start auto tuning.
2024-04-22 11:11:08 [INFO] Quantize model without tuning!
2024-04-22 11:11:08 [INFO] Quantize the model with default configuration without evaluating the model. To perform the tuning process, please either provide an eval_func or provide an eval_dataloader an eval_metric.
2024-04-22 11:11:08 [INFO] Adaptor has 5 recipes.
2024-04-22 11:11:08 [INFO] 0 recipes specified by user.
2024-04-22 11:11:08 [INFO] 3 recipes require future tuning.
2024-04-22 11:11:08 [WARNING] Backend `onnxrt_dml_ep` requires a NPU device. Reset device to 'npu'.
2024-04-22 11:11:08 [INFO] *** Initialize auto tuning
Exception in thread Thread-4:
2024-04-22 11:11:08 [INFO] {
Traceback (most recent call last):
File "C:\Anaconda\envs\myolive\lib\threading.py", line 980, in _bootstrap_inner
2024-04-22 11:11:08 [INFO] 'PostTrainingQuantConfig': {
2024-04-22 11:11:08 [INFO] 'AccuracyCriterion': {
2024-04-22 11:11:08 [INFO] 'criterion': 'relative',
2024-04-22 11:11:08 [INFO] 'higher_is_better': True,
2024-04-22 11:11:08 [INFO] 'tolerable_loss': 0.01,
2024-04-22 11:11:08 [INFO] 'absolute': None,
2024-04-22 11:11:08 [INFO] 'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x000002D685301C40>>,
2024-04-22 11:11:08 [INFO] 'relative': 0.01
2024-04-22 11:11:08 [INFO] },
2024-04-22 11:11:08 [INFO] 'approach': 'post_training_weight_only',
2024-04-22 11:11:08 [INFO] 'backend': 'onnxrt_dml_ep',
2024-04-22 11:11:08 [INFO] 'calibration_sampling_size': [
2024-04-22 11:11:08 [INFO] 8
2024-04-22 11:11:08 [INFO] ],
2024-04-22 11:11:08 [INFO] 'device': 'gpu',
2024-04-22 11:11:08 [INFO] 'diagnosis': False,
2024-04-22 11:11:08 [INFO] 'domain': 'auto',
2024-04-22 11:11:08 [INFO] 'example_inputs': 'Not printed here due to large size tensors...',
2024-04-22 11:11:08 [INFO] 'excluded_precisions': [
2024-04-22 11:11:08 [INFO] ],
2024-04-22 11:11:08 [INFO] 'framework': 'onnxruntime',
2024-04-22 11:11:08 [INFO] 'inputs': [
2024-04-22 11:11:08 [INFO] ],
2024-04-22 11:11:08 [INFO] 'model_name': '',
2024-04-22 11:11:08 [INFO] 'ni_workload_name': 'quantization',
2024-04-22 11:11:08 [INFO] 'op_name_dict': None,
2024-04-22 11:11:08 [INFO] 'op_type_dict': {
2024-04-22 11:11:08 [INFO] '.*': {
2024-04-22 11:11:08 [INFO] 'weight': {
2024-04-22 11:11:08 [INFO] 'bits': [
2024-04-22 11:11:08 [INFO] 4
2024-04-22 11:11:08 [INFO] ],
2024-04-22 11:11:08 [INFO] 'group_size': [
2024-04-22 11:11:08 [INFO] 32
2024-04-22 11:11:08 [INFO] ],
2024-04-22 11:11:08 [INFO] 'scheme': [
2024-04-22 11:11:08 [INFO] 'asym'
2024-04-22 11:11:08 [INFO] ],
2024-04-22 11:11:08 [INFO] 'algorithm': [
2024-04-22 11:11:08 [INFO] 'AWQ'
2024-04-22 11:11:08 [INFO] ]
2024-04-22 11:11:08 [INFO] }
2024-04-22 11:11:08 [INFO] }
2024-04-22 11:11:08 [INFO] },
2024-04-22 11:11:08 [INFO] 'outputs': [
2024-04-22 11:11:08 [INFO] ],
2024-04-22 11:11:08 [INFO] 'quant_format': 'QOperator',
2024-04-22 11:11:08 [INFO] 'quant_level': 'auto',
2024-04-22 11:11:08 [INFO] 'recipes': {
2024-04-22 11:11:08 [INFO] 'smooth_quant': False,
2024-04-22 11:11:08 [INFO] 'smooth_quant_args': {
2024-04-22 11:11:08 [INFO] },
2024-04-22 11:11:08 [INFO] 'layer_wise_quant': False,
2024-04-22 11:11:08 [INFO] 'layer_wise_quant_args': {
2024-04-22 11:11:08 [INFO] },
2024-04-22 11:11:08 [INFO] 'fast_bias_correction': False,
2024-04-22 11:11:08 [INFO] 'weight_correction': False,
2024-04-22 11:11:08 [INFO] 'gemm_to_matmul': True,
2024-04-22 11:11:08 [INFO] 'graph_optimization_level': None,
2024-04-22 11:11:08 [INFO] 'first_conv_or_matmul_quantization': True,
2024-04-22 11:11:08 [INFO] 'last_conv_or_matmul_quantization': True,
2024-04-22 11:11:08 [INFO] 'pre_post_process_quantization': True,
2024-04-22 11:11:08 [INFO] 'add_qdq_pair_to_weight': False,
2024-04-22 11:11:08 [INFO] 'optypes_to_exclude_output_quant': [
2024-04-22 11:11:08 [INFO] ],
2024-04-22 11:11:08 [INFO] 'dedicated_qdq_pair': False,
2024-04-22 11:11:08 [INFO] 'rtn_args': {
2024-04-22 11:11:08 [INFO] },
2024-04-22 11:11:08 [INFO] 'awq_args': {
2024-04-22 11:11:08 [INFO] },
2024-04-22 11:11:08 [INFO] 'gptq_args': {
2024-04-22 11:11:08 [INFO] },
2024-04-22 11:11:08 [INFO] 'teq_args': {
2024-04-22 11:11:08 [INFO] }
2024-04-22 11:11:08 [INFO] },
2024-04-22 11:11:08 [INFO] 'reduce_range': False,
2024-04-22 11:11:08 [INFO] 'TuningCriterion': {
2024-04-22 11:11:08 [INFO] 'max_trials': 100,
2024-04-22 11:11:08 [INFO] 'objective': [
2024-04-22 11:11:08 [INFO] 'performance'
2024-04-22 11:11:08 [INFO] ],
2024-04-22 11:11:08 [INFO] 'strategy': 'basic',
2024-04-22 11:11:08 [INFO] 'strategy_kwargs': None,
2024-04-22 11:11:08 [INFO] 'timeout': 0
2024-04-22 11:11:08 [INFO] },
2024-04-22 11:11:08 [INFO] 'use_bf16': True
2024-04-22 11:11:08 [INFO] }
2024-04-22 11:11:08 [INFO] }
2024-04-22 11:11:08 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2024-04-22 11:11:08 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2024-04-22 11:11:08 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
self.run()
File "C:\Anaconda\envs\myolive\lib\threading.py", line 1304, in run
self.finished.wait(self.interval)
File "C:\Anaconda\envs\myolive\lib\threading.py", line 581, in wait
signaled = self._cond.wait(timeout)
File "C:\Anaconda\envs\myolive\lib\threading.py", line 316, in wait
gotit = waiter.acquire(True, timeout)
OverflowError: timeout value is too large
2024-04-22 11:11:21 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-04-22 11:11:21 [INFO] Quantize the model with default config.
Progress: [ ] 0.78%Running model for sample 0
Running model for sample 1
2024-04-22 11:11:55 [ERROR] Unexpected exception Fail('[ONNXRuntimeError] : 1 : FAIL : C:\\a\\_work\\1\\s\\onnxruntime\\core\\providers\\dml\\DmlExecutionProvider\\src\\DmlCommandRecorder.cpp(371)\\onnxruntime_pybind11_state.pyd!00007FFE81F92BFE: (caller: 00007FFE81F79804) Exception(1) tid(3004c) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.\r\n') happened during tuning.
Traceback (most recent call last):
File "C:\Anaconda\envs\myolive\lib\site-packages\neural_compressor\quantization.py", line 234, in fit
strategy.traverse()
File "C:\Anaconda\envs\myolive\lib\site-packages\neural_compressor\strategy\auto.py", line 140, in traverse
super().traverse()
File "C:\Anaconda\envs\myolive\lib\site-packages\neural_compressor\strategy\strategy.py", line 508, in traverse
q_model = self.adaptor.quantize(copy.deepcopy(tune_cfg), self.model, self.calib_dataloader, self.q_func)
File "C:\Anaconda\envs\myolive\lib\site-packages\neural_compressor\utils\utility.py", line 304, in fi
res = func(*args, **kwargs)
File "C:\Anaconda\envs\myolive\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 1965, in quantize
tmp_model = awq_quantize(
File "C:\Anaconda\envs\myolive\lib\site-packages\neural_compressor\adaptor\ox_utils\weight_only.py", line 844, in awq_quantize
output = session.run([input_name], inp)
File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 220, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlCommandRecorder.cpp(371)\onnxruntime_pybind11_state.pyd!00007FFE81F92BFE: (caller: 00007FFE81F79804) Exception(1) tid(3004c) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.
2024-04-22 11:11:55 [ERROR] Specified timeout or max trials is reached! Not found any quantized model which meet accuracy goal. Exit.
Traceback (most recent call last):
File "C:\Olive\examples\directml\llm\llm.py", line 391, in <module>
main()
File "C:\Olive\examples\directml\llm\llm.py", line 349, in main
optimize(
File "C:\Olive\examples\directml\llm\llm.py", line 231, in optimize
olive_run(olive_config)
File "C:\Anaconda\envs\myolive\lib\site-packages\olive\workflows\run\run.py", line 283, in run
return run_engine(package_config, run_config, data_root)
File "C:\Anaconda\envs\myolive\lib\site-packages\olive\workflows\run\run.py", line 237, in run_engine
engine.run(
File "C:\Anaconda\envs\myolive\lib\site-packages\olive\engine\engine.py", line 264, in run
run_result = self.run_accelerator(
File "C:\Anaconda\envs\myolive\lib\site-packages\olive\engine\engine.py", line 346, in run_accelerator
output_footprint = self.run_search(
File "C:\Anaconda\envs\myolive\lib\site-packages\olive\engine\engine.py", line 531, in run_search
should_prune, signal, model_ids = self._run_passes(
File "C:\Anaconda\envs\myolive\lib\site-packages\olive\engine\engine.py", line 826, in _run_passes
model_config, model_id = self._run_pass(
File "C:\Anaconda\envs\myolive\lib\site-packages\olive\engine\engine.py", line 934, in _run_pass
output_model_config = host.run_pass(p, input_model_config, data_root, output_model_path, pass_search_point)
File "C:\Anaconda\envs\myolive\lib\site-packages\olive\systems\local.py", line 31, in run_pass
output_model = the_pass.run(model, data_root, output_model_path, point)
File "C:\Anaconda\envs\myolive\lib\site-packages\olive\passes\olive_pass.py", line 221, in run
output_model = self._run_for_config(model, data_root, config, output_model_path)
File "C:\Anaconda\envs\myolive\lib\site-packages\olive\passes\onnx\inc_quantization.py", line 588, in _run_for_config
if q_model.is_large_model:
AttributeError: 'NoneType' object has no attribute 'is_large_model'
I'm not sure what this warning is about (it comes from INC), but you definitely don't need an NPU for the quantization. I think it's likely that your device is running out of memory here, since 16gb of VRAM is barely enough to run the fp16 model normally, and quantization is more demanding. We have only confirmed that the quantization is working with RTX 4090 cards.
We are looking at different quantization options since a lot of the AWQ quantization options out there are hard to use on consumer hardware and generally require powerful server machines or powerful GPUs to complete in a timely manner.
If all you're interested in is converting to 4 bit to test the performance of the model, you can play around with the script and change the quantization strategy here to RTN
:
Olive/examples/directml/llm/llm.py
Line 150 in 4e23c4c
It's not something that we have tested though since RTN is generally bad for LLMs.
I was able to quantize the Mistral-7B on the same hardware using examples/mistral/mistral_int4_optimize.json. But I could not run inference on the quantized model using DML EP (see this issue for more details). I will try using that quantized model with examples/directml/llm/run_llm_io_binding.py for inference.
Regarding the code in LLM Optimization with DirectML, although I could not quantize using AWQ, I could convert Mistral successfully using the following.
python llm.py --model_type=mistral-7b-chat
The log shows that it also successfully carried out inference using the prompt "What is the lightest element?". However, when I try to infer using python run_llm_io_binding.py --model_type=mistral-7b-chat --prompt="What is the lightest element?"
, it does not work most of the times.
Here is the successful log from Mistral conversion to onnx format
C:\Olive\examples\directml\llm>python llm.py --model_type=mistral-7b-chat
Optimizing mistralai/Mistral-7B-Instruct-v0.1
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.95s/it]
[2024-04-22 11:45:40,473] [INFO] [run.py:261:run] Loading Olive module configuration from: C:\Anaconda\envs\myolive\lib\site-packages\olive\olive_config.json
[2024-04-22 11:45:40,489] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-04-22 11:45:40,489] [INFO] [engine.py:106:initialize] Using cache directory: cache
[2024-04-22 11:45:40,489] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-dml
[2024-04-22 11:45:40,504] [INFO] [engine.py:864:_run_pass] Running pass convert:OnnxConversion
[2024-04-22 11:49:28,442] [INFO] [engine.py:951:_run_pass] Pass convert:OnnxConversion finished in 227.921767 seconds
[2024-04-22 11:49:28,457] [INFO] [engine.py:864:_run_pass] Running pass optimize:OrtTransformersOptimization
failed in shape inference <class 'AssertionError'>
Failed to run symbolic shape inference. Please file an issue in https://github.com/microsoft/onnxruntime.
failed in shape inference <class 'AssertionError'>
[2024-04-22 12:02:21,910] [INFO] [transformer_optimization.py:420:_replace_mha_with_gqa] Replaced 32 MultiHeadAttention nodes with GroupQueryAttention
[2024-04-22 12:02:32,629] [INFO] [engine.py:951:_run_pass] Pass optimize:OrtTransformersOptimization finished in 784.171654 seconds
[2024-04-22 12:02:32,654] [INFO] [engine.py:842:_run_passes] Run model evaluation for the final model...
[2024-04-22 12:02:42,560] [INFO] [footprint.py:101:create_pareto_frontier] Output all 3 models
[2024-04-22 12:02:42,560] [INFO] [footprint.py:120:_create_pareto_frontier_from_nodes] pareto frontier points: 1_OrtTransformersOptimization-0-d4c4ec660cc893c3eeab183690fc3aca-gpu-dml
{
"latency-avg": 86.65992
}
[2024-04-22 12:02:42,560] [INFO] [engine.py:361:run_accelerator] Save footprint to footprints\mistralai_Mistral-7B-Instruct-v0.1_gpu-dml_footprints.json.
[2024-04-22 12:02:42,576] [INFO] [engine.py:279:run] Run history for gpu-dml:
[2024-04-22 12:02:42,629] [INFO] [engine.py:567:dump_run_history] run history:
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+---------------------------+
| model_id | parent_model_id | from_pass | duration_sec | metrics |
+====================================================================================+====================================================================================+=============================+================+===========================+
| ce39a7112b2825df5404fbb628c489ab | | | | |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+---------------------------+
| 0_OnnxConversion-ce39a7112b2825df5404fbb628c489ab-46a1dd3a2459690b350e4070c8e2c14a | ce39a7112b2825df5404fbb628c489ab | OnnxConversion | 227.922 | |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+---------------------------+
| 1_OrtTransformersOptimization-0-d4c4ec660cc893c3eeab183690fc3aca-gpu-dml | 0_OnnxConversion-ce39a7112b2825df5404fbb628c489ab-46a1dd3a2459690b350e4070c8e2c14a | OrtTransformersOptimization | 784.172 | { |
| | | | | "latency-avg": 86.65992 |
| | | | | } |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+---------------------------+
[2024-04-22 12:02:42,639] [INFO] [engine.py:294:run] No packaging config provided, skip packaging artifacts
Optimized Model : C:\Olive\examples\directml\llm\cache\models\1_OrtTransformersOptimization-0-d4c4ec660cc893c3eeab183690fc3aca-gpu-dml\output_model\model.onnx
Copying optimized model...
The optimized pipeline is located here: C:\Olive\examples\directml\llm\models\optimized\mistralai_Mistral-7B-Instruct-v0.1
The lightest element is hydrogen with an atomic number of 1 and atomic weight of approximately 1.008 g/mol.
Here are logs from eight inference attempts of which only two attempts worked.
**RUN 1 (FAILED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="The world in 2099 is"
**RUN 2 (FAILED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="What is the lightest element?"
2024-04-22 14:20:54.6846266 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_de9340899c8cfefde68f4d8c5936aa80>::operator ()] Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(31a68) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.
Traceback (most recent call last):
File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 183, in <module>
run_llm_io_binding(
File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 53, in run_llm_io_binding
llm_session = onnxruntime.InferenceSession(
File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(31a68) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.
**RUN 3 (WORKED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="What is the lightest element?"
The lightest element is hydrogen with an atomic number of 1 and atomic weight of approximately 1.008 g/mol.
**RUN 4 (WORKED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="What is the lightest element?"
The lightest element is hydrogen with an atomic number of 1 and atomic weight of approximately 1.008 g/mol.
**RUN 5 (FAILED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="The world in 2099 is"
2024-04-22 14:28:01.8389389 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_de9340899c8cfefde68f4d8c5936aa80>::operator ()] Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(31bf0) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.
Traceback (most recent call last):
File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 183, in <module>
run_llm_io_binding(
File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 53, in run_llm_io_binding
llm_session = onnxruntime.InferenceSession(
File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(31bf0) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.
**RUN 6 (FAILED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="How is the world in 2099?"
2024-04-22 14:28:20.0418613 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_de9340899c8cfefde68f4d8c5936aa80>::operator ()] Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(3a3c0) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.
Traceback (most recent call last):
File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 183, in <module>
run_llm_io_binding(
File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 53, in run_llm_io_binding
llm_session = onnxruntime.InferenceSession(
File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(3a3c0) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.
**RUN 7 (FAILED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="What is the heaviest element?"
**RUN 8 (FAILED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="What is the lightest element?"
2024-04-22 14:39:18.5897298 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_de9340899c8cfefde68f4d8c5936aa80>::operator ()] Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(3a3e0) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.
Traceback (most recent call last):
File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 183, in <module>
run_llm_io_binding(
File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 53, in run_llm_io_binding
llm_session = onnxruntime.InferenceSession(
File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(3a3e0) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.
Any tips on what is happening here?
UPDATE: I started using onnxruntime-genai-directml 0.2.0rc3
and it finally worked!!
I tried so many different things that it is hard to summarize them. Anyway, here are a couple of things I tried:
- Converting, optimizing, and quantizing the
mistralai/Mistral-7B-Instruct-v0.1
huggingface model with theDmlExecutionProvider
using the code in LLM Optimization with DirectML. I was able to convert and optimize the model, but not quantize it. Check my error log from the above conversation. - Quantizing the optimized onnx model.
Themistralai/Mistral-7B-Instruct-v0.1
huggingface model converted to onnx format was in thecache/models/output_model/0_OnnxConversion-...
directory and 27 GB on disk. The optimized onnx model was in thecache/models/output_model/1_OrtTransformersOptimization-...
directory and 13.5 GB on disk. I investigated ways to quantize the optimized onnx model to INT4 but nothing seemed to work as every method kept giving some errors. Also, inference using the optimized onnx model didn't work for me. I believe the issues with quantization and inference were arising due to something specific to 'optimizing' the onnx model, but I am not sure about it.
Finally, I quantized and performed inference using onnxruntime-genai-directml 0.2.0rc3
.
Quantization:
python -m onnxruntime_genai.models.builder -m mistralai/Mistral-7B-Instruct-v0.1 -e dml -p int4 -o ./models/mistral-int4
Inference:
python model-qa.py -m ./models/mistral-int4
The quantization was a bit too fast (took me 1-2 min). However, the quality of the quantized model is really good and I saw no weird responses from the LLM. The size of this INT4 quantized model on disk is 3.97 GB.