bug: tensorRT - Switching between model is causing error satisfyProfile Runtime dimension does not satisfy any optimization profile
Van-QA opened this issue · 2 comments
Van-QA commented
Describe the bug
Switching between TensorRT model is causing error satisfyProfile Runtime dimension does not satisfy any optimization profile
Steps to reproduce
- Attempt to switch back and forth between multiple model.
- Observe that the transition is not smooth and the application becomes unresponsive.
- Repeat the process.
- Something amiss in the UI, app.log error
- Users are forced to close the application and reopen it to resolve the issue.
Expected behavior
Switching between TensorRT models in Jan should provide a seamless transition experience without causing the application to freeze or become unresponsive. Users should be able to switch between models smoothly without encountering any disruptions or glitches.
Environment details
- Operating System: Windows 11
- Jan Version: Jan v0.4.8-325
[1710424192] [D:\a\nitro\nitro\controllers\llamaCPP.h: 1585][llama_server_context::update_slots] slot 0 released (1638 tokens in cache)
2024-03-14T13:49:52.993Z [NITRO]::Debug: 20240314 13:49:52.991000 UTC 26240 INFO Wait for task to be released:6 - llamaCPP.cc:405
20240314 13:49:52.991000 UTC 43012 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:535
[1710424192] [D:\a\nitro\nitro\controllers\llamaCPP.h: 882][llama_server_context::launch_slot_with_data] slot 0 is processing [task id: 6]
2024-03-14T13:49:53.000Z [NITRO]::Debug: [1710424192] [D:\a\nitro\nitro\controllers\llamaCPP.h: 1722][llama_server_context::update_slots] slot 0 : kv cache rm - [0, end)
2024-03-14T13:50:10.996Z [NITRO]::Debug: [1710424210] [D:\a\nitro\nitro\controllers\llamaCPP.h: 475][llama_client_slot::print_timings]
[1710424210] [D:\a\nitro\nitro\controllers\llamaCPP.h: 480][llama_client_slot::print_timings] print_timings: prompt eval time = 14552.11 ms / 1682 tokens ( 8.65 ms per token, 115.58 tokens per second)
[1710424210] [D:\a\nitro\nitro\controllers\llamaCPP.h: 485][llama_client_slot::print_timings] print_timings: eval time = 3452.00 ms / 94 runs ( 36.72 ms per token, 27.23 tokens per second)
[1710424210] [D:\a\nitro\nitro\controllers\llamaCPP.h: 487][llama_client_slot::print_timings] print_timings: total time = 18004.10 ms
[1710424210] [D:\a\nitro\nitro\controllers\llamaCPP.h: 1585][llama_server_context::update_slots] slot 0 released (1777 tokens in cache)
2024-03-14T13:50:23.932Z [NITRO]::Debug: Request to kill Nitro
2024-03-14T13:50:23.935Z [NITRO]::Debug: 20240314 13:50:10.993000 UTC 43012 INFO reached result stop - llamaCPP.cc:365
20240314 13:50:10.993000 UTC 43012 INFO End of result - llamaCPP.cc:338
20240314 13:50:11.068000 UTC 26240 INFO Task completed, release it - llamaCPP.cc:408
20240314 13:50:23.934000 UTC 2088 INFO Program is exitting, goodbye! - processManager.cc:8
20240314 13:50:23.934000 UTC 2088 INFO changed to false - llamaCPP.cc:680
[1710424223] [D:\a\nitro\nitro\controllers\llamaCPP.h: 1585][llama_server_context::update_slots] slot 0 released (1777 tokens in cache)
2024-03-14T13:50:24.953Z [TENSORRT_LLM_NITRO]::Debug:Request to kill engine
2024-03-14T13:50:27.489Z [NITRO]::Debug: Nitro process is terminated
2024-03-14T13:50:27.490Z [TENSORRT_LLM_NITRO]::Debug:Engine process is terminated
2024-03-14T13:50:27.490Z [TENSORRT_LLM_NITRO]::Debug:Spawning engine subprocess...
2024-03-14T13:50:27.490Z [TENSORRT_LLM_NITRO]::Debug:Spawn nitro at path: C:\Users\dan\jan\extensions\@janhq\tensorrt-llm-extension\dist\bin\nitro.exe, and args: 1,127.0.0.1,3928
2024-03-14T13:50:27.495Z [NITRO]::Debug: Nitro exited with code: 3221226505
2024-03-14T13:50:27.682Z [TENSORRT_LLM_NITRO]::Debug:�[0m
�[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[0m
�[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[0m
�[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[0m
�[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[0m
�[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[0m
�[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[0m
�[94m �[0m
�[94m �[94m �[0m�[1;32m �[0m
�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[0m
�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m �[1;32m �[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m �[1;32m_�[1;32m_�[1;32m �[1;32m �[1;32m_�[1;32m_�[1;
2024-03-14T13:50:27.808Z [TENSORRT_LLM_NITRO]::Debug:Engine is ready
2024-03-14T13:50:27.809Z [TENSORRT_LLM_NITRO]::Debug:Loading model with params {"engine_path":"C:\\Users\\dan\\jan\\models\\llamacorn-1.1b-chat-fp16","ctx_len":2048}
2024-03-14T13:50:28.731Z [TENSORRT_LLM_NITRO]::Debug:32m �[1;32m �[1;32m �[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m �[1;32m �[1;32m �[1;32m_�[1;32m_�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[0m
�[1;32m_�[1;32m_�[1;32m_�[1;32m/�[1;32m �[1;32m_�[1;32m_�[1;32m �[1;32m\�[1;32m_�[1;32m_�[1;32m �[1;32m �[1;32m_�[1;32m_�[1;32m/�[1;32m_�[1;32m �[1;32m �[1;32m|�[1;32m/�[1;32m �[1;32m/�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m_�[1;32m_�[1;32m �[1;32m �[1;32m_�[1;32m_�[1;32m �[1;32m\�[1;32m_�[1;32m_�[1;32m �[1;32m �[1;32m|�[1;32m �[1;32m/�[1;32m �[1;32m/�[0m
�[1;32m_�[1;32m_�[1;32m/�[1;32m �[1;32m/�[1;32m_�[1;32m/�[1;32m �[1;32m/�[1;32m_�[1;32m/�[1;32m �[1;32m/�[1;32m �[1;32m �[1;32m_�[1;32m\�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m/�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m_�[1;32m/�[1;32m �[1;32m/�[1;32m �[1;32m/�[1;32m �[1;32m/�[1;32m_�[1;32m �[1;32m �[1;32m �[1;32m|�[1;32m/�[1;32m �[1;32m/�[1;32m �[0m
�[1;32m_�[1;32m/�[1;32m �[1;32m_�[1;32m,�[1;32m �[1;32m_�[1;32m/�[1;32m_�[1;32m/�[1;32m �[1;32m/�[1;32m �[1;32m �[1;32m �[1;32m_�[1;32m/�[1;32m �[1;32m �[1;32m �[1;32m|�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m/�[1;32m �[1;32m/�[1;32m_�[1;32m/�[1;32m �[1;32m/�[1;32m_�[1;32m �[1;32m �[1;32m/�[1;32m|�[1;32m �[1;32m �[1;32m/�[1;32m �[1;32m �[0m
�[1;32m/�[1;32m_�[1;32m/�[1;32m �[1;32m|�[1;32m_�[1;32m|�[1;32m �[1;32m/�[1;32m_�[1;32m/�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m/�[1;32m_�[1;32m/�[1;32m|�[1;32m_�[1;32m|�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m\�[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m/�[1;32m �[1;32m/�[1;32m_�[1;32m/�[1;32m �[1;32m|�[1;32m_�[1;32m/�[1;32m �[1;32m �[1;32m �[0m
�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[0m
�[0m20240314 13:50:27.681000 UTC 10304 INFO Nitro version: undefined - main.cc:57
20240314 13:50:27.681000 UTC 10304 INFO Server started, listening at: 127.0.0.1:3928 - main.cc:59
20240314 13:50:27.681000 UTC 10304 INFO Please load your model - main.cc:60
20240314 13:50:27.681000 UTC 10304 INFO Number of thread is:1 - main.cc:68
[TensorRT-LLM][INFO] Set logger level by INFO
2024-03-14T13:50:28.746Z [TENSORRT_LLM_NITRO]::Debug:20240314 13:50:28.743000 UTC 12044 INFO Successully loaded the tokenizer - tensorrtllm.h:53
20240314 13:50:28.743000 UTC 12044 INFO Loaded tokenizer - tensorrtllm.cc:354
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
2024-03-14T13:50:28.753Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] MPI size: 1, rank: 0
2024-03-14T13:50:31.055Z [TENSORRT_LLM_NITRO]::Debug:20240314 13:50:28.753000 UTC 12044 INFO Engine Path : C:\Users\dan\jan\models\llamacorn-1.1b-chat-fp16\rank0.engine - tensorrtllm.cc:361
[TensorRT-LLM][INFO] Loaded engine size: 2100 MiB
2024-03-14T13:50:31.063Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
2024-03-14T13:50:31.537Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 20760, GPU 3251 (MiB)
2024-03-14T13:50:31.550Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +6, GPU +8, now: CPU 20766, GPU 3259 (MiB)
2024-03-14T13:50:31.567Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +2098, now: CPU 0, GPU 2098 (MiB)
2024-03-14T13:50:31.574Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 20813, GPU 3353 (MiB)
2024-03-14T13:50:31.576Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +6, GPU +8, now: CPU 20819, GPU 3361 (MiB)
2024-03-14T13:50:31.688Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2098 (MiB)
2024-03-14T13:50:31.694Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] Allocate 255328256 bytes for k/v cache.
[TensorRT-LLM][INFO] Using 201984 tokens in paged KV cache.
2024-03-14T13:50:31.822Z [TENSORRT_LLM_NITRO]::Debug:Load model success with response {}
2024-03-14T13:50:31.916Z [TENSORRT_LLM_NITRO]::Debug:20240314 13:50:31.856000 UTC 12044 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:535
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::nvinfer1::rt::ExecutionContext::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::nvinfer1::rt::ExecutionContext::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)
job aborted:
[ranks] message
[0] application aborted
aborting MPI_COMM_WORLD (comm=0x44000000), error 1, comm rank 0
Originally posted by @dan-jan in janhq/jan#2358 (comment)
github-actions commented
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."