EricLBuehler/mistral.rs

CUDA_ERROR_UNSUPPORTED_PTX_VERSION on Jetson AGX Orin

bmgxyz opened this issue · 2 comments

Describe the bug

When I run this command:

cargo run --bin mistralrs-server --release --features "cuda" -- -i gguf -m /external/bradley/llama.cpp/models -f llama-31-70B-Q4-K-M.gguf

I get the following error:

Error: DriverError(CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain.") when loading dequantize_block_q4_K_f32

I have used this same model file with llama.cpp on the same platform, so I don't think the file is the problem.

Click to see full output
    Finished `release` profile [optimized] target(s) in 0.36s
     Running `target/release/mistralrs-server -i gguf -m /external/bradley/llama.cpp/models -f llama-31-70B-Q4-K-M.gguf`
2024-10-19T20:08:24.006809Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-10-19T20:08:24.007090Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-10-19T20:08:24.007133Z  INFO mistralrs_server: Model kind is: gguf quantized from gguf (no adapters)
2024-10-19T20:08:24.007232Z  INFO mistralrs_core::utils::tokens: Could not load token at "/home/bradley/.cache/huggingface/token", using no HF token.
2024-10-19T20:08:24.007350Z  INFO mistralrs_core::utils::tokens: Could not load token at "/home/bradley/.cache/huggingface/token", using no HF token.
2024-10-19T20:08:24.007379Z  INFO mistralrs_core::pipeline::paths: Loading `llama-31-70B-Q4-K-M.gguf` locally at `/external/bradley/llama.cpp/models/llama-31-70B-Q4-K-M.gguf`
2024-10-19T20:08:24.007642Z  INFO mistralrs_core::pipeline::gguf: Loading model `/external/bradley/llama.cpp/models` on cuda[0].
2024-10-19T20:08:24.573480Z  INFO mistralrs_core::gguf::content: Model config:
general.architecture: llama
general.base_model.0.name: Meta Llama 3.1 70B
general.base_model.0.organization: Meta Llama
general.base_model.0.repo_url: https://huggingface.co/meta-llama/Meta-Llama-3.1-70B
general.base_model.count: 1
general.file_type: 15
general.finetune: 33101ce6ccc08fa6249c10a543ebfcac65173393
general.languages: en, de, fr, it, pt, hi, es, th
general.license: llama3.1
general.name: 33101ce6ccc08fa6249c10a543ebfcac65173393
general.quantization_version: 2
general.size_label: 71B
general.tags: facebook, meta, pytorch, llama, llama-3, text-generation
general.type: model
llama.attention.head_count: 64
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 80
llama.context_length: 131072
llama.embedding_length: 8192
llama.feed_forward_length: 28672
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128256
2024-10-19T20:08:24.982454Z  INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 128256, num added tokens: 0, num merges: 280147, num scores: 0
2024-10-19T20:08:24.993793Z  INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{{- bos_token }}\n{%- if custom_tools is defined %}\n    {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n    {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n    {%- set date_string = "26 Jul 2024" %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content']|trim %}\n    {%- set messages = messages[1:] %}\n{%- else %}\n    {%- set system_message = "" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}\n{%- if builtin_tools is defined or tools is not none %}\n    {{- "Environment: ipython\n" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n    {{- "Tools: " + builtin_tools | reject('equalto', 'code_interpreter') | join(", ") + "\n\n"}}\n{%- endif %}\n{{- "Cutting Knowledge Date: December 2023\n" }}\n{{- "Today Date: " + date_string + "\n\n" }}\n{%- if tools is not none and not tools_in_user_message %}\n    {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}\n    {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}\n    {{- "Do not use variables.\n\n" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- "\n\n" }}\n    {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- "<|eot_id|>" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n    {#- Extract the first user message so we can plug it in here #}\n    {%- if messages | length != 0 %}\n        {%- set first_user_message = messages[0]['content']|trim %}\n        {%- set messages = messages[1:] %}\n    {%- else %}\n        {{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}\n{%- endif %}\n    {{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}\n    {{- "Given the following functions, please respond with a JSON for a function call " }}\n    {{- "with its proper arguments that best answers the given prompt.\n\n" }}\n    {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}\n    {{- "Do not use variables.\n\n" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- "\n\n" }}\n    {%- endfor %}\n    {{- first_user_message + "<|eot_id|>"}}\n{%- endif %}\n\n{%- for message in messages %}\n    {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n        {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}\n    {%- elif 'tool_calls' in message %}\n        {%- if not message.tool_calls|length == 1 %}\n            {{- raise_exception("This model only supports single tool-calls at once!") }}\n        {%- endif %}\n        {%- set tool_call = message.tool_calls[0].function %}\n        {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n            {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}\n            {{- "<|python_tag|>" + tool_call.name + ".call(" }}\n            {%- for arg_name, arg_val in tool_call.arguments | items %}\n                {{- arg_name + '="' + arg_val + '"' }}\n                {%- if not loop.last %}\n                    {{- ", " }}\n                {%- endif %}\n                {%- endfor %}\n            {{- ")" }}\n        {%- else  %}\n            {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}\n            {{- '{"name": "' + tool_call.name + '", ' }}\n            {{- '"parameters": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- "}" }}\n        {%- endif %}\n        {%- if builtin_tools is defined %}\n            {#- This means we're in ipython mode #}\n            {{- "<|eom_id|>" }}\n        {%- else %}\n            {{- "<|eot_id|>" }}\n        {%- endif %}\n    {%- elif message.role == "tool" or message.role == "ipython" %}\n        {{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}\n        {%- if message.content is mapping or message.content is iterable %}\n            {{- message.content | tojson }}\n        {%- else %}\n            {{- message.content }}\n        {%- endif %}\n        {{- "<|eot_id|>" }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}\n{%- endif %}\n`
Error: DriverError(CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain.") when loading dequantize_block_q4_K_f32

My system is an Nvidia Jetson AGX Orin 64 GB Developer Kit.

Click to show output of deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          12.2 / 12.4
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 62841 MBytes (65893945344 bytes)
  (016) Multiprocessors, (128) CUDA Cores/MP:    2048 CUDA Cores
  GPU Max Clock rate:                            1300 MHz (1.30 GHz)
  Memory Clock rate:                             1300 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.4, NumDevs = 1
Result = PASS

Could this be a problem with ARM support? Or maybe a build.rs in some dependency is using the wrong version of nvcc somehow? Also possible that this is a usage error on my part.

Latest commit or version

32e8945, current master as of writing. Also tried v0.3.1 with the same result.

yeah, cuda does not work for me neither for NVIDIA L4 #850