CUDA_ERROR_UNSUPPORTED_PTX_VERSION on Jetson AGX Orin
bmgxyz opened this issue · 2 comments
bmgxyz commented
Describe the bug
When I run this command:
cargo run --bin mistralrs-server --release --features "cuda" -- -i gguf -m /external/bradley/llama.cpp/models -f llama-31-70B-Q4-K-M.gguf
I get the following error:
Error: DriverError(CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain.") when loading dequantize_block_q4_K_f32
I have used this same model file with llama.cpp
on the same platform, so I don't think the file is the problem.
Click to see full output
Finished `release` profile [optimized] target(s) in 0.36s
Running `target/release/mistralrs-server -i gguf -m /external/bradley/llama.cpp/models -f llama-31-70B-Q4-K-M.gguf`
2024-10-19T20:08:24.006809Z INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2024-10-19T20:08:24.007090Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-10-19T20:08:24.007133Z INFO mistralrs_server: Model kind is: gguf quantized from gguf (no adapters)
2024-10-19T20:08:24.007232Z INFO mistralrs_core::utils::tokens: Could not load token at "/home/bradley/.cache/huggingface/token", using no HF token.
2024-10-19T20:08:24.007350Z INFO mistralrs_core::utils::tokens: Could not load token at "/home/bradley/.cache/huggingface/token", using no HF token.
2024-10-19T20:08:24.007379Z INFO mistralrs_core::pipeline::paths: Loading `llama-31-70B-Q4-K-M.gguf` locally at `/external/bradley/llama.cpp/models/llama-31-70B-Q4-K-M.gguf`
2024-10-19T20:08:24.007642Z INFO mistralrs_core::pipeline::gguf: Loading model `/external/bradley/llama.cpp/models` on cuda[0].
2024-10-19T20:08:24.573480Z INFO mistralrs_core::gguf::content: Model config:
general.architecture: llama
general.base_model.0.name: Meta Llama 3.1 70B
general.base_model.0.organization: Meta Llama
general.base_model.0.repo_url: https://huggingface.co/meta-llama/Meta-Llama-3.1-70B
general.base_model.count: 1
general.file_type: 15
general.finetune: 33101ce6ccc08fa6249c10a543ebfcac65173393
general.languages: en, de, fr, it, pt, hi, es, th
general.license: llama3.1
general.name: 33101ce6ccc08fa6249c10a543ebfcac65173393
general.quantization_version: 2
general.size_label: 71B
general.tags: facebook, meta, pytorch, llama, llama-3, text-generation
general.type: model
llama.attention.head_count: 64
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 80
llama.context_length: 131072
llama.embedding_length: 8192
llama.feed_forward_length: 28672
llama.rope.dimension_count: 128
llama.rope.freq_base: 500000
llama.vocab_size: 128256
2024-10-19T20:08:24.982454Z INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `gpt2`, kind: `Bpe`, num tokens: 128256, num added tokens: 0, num merges: 280147, num scores: 0
2024-10-19T20:08:24.993793Z INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = "26 Jul 2024" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = "" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- "Environment: ipython\n" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- "Tools: " + builtin_tools | reject('equalto', 'code_interpreter') | join(", ") + "\n\n"}}\n{%- endif %}\n{{- "Cutting Knowledge Date: December 2023\n" }}\n{{- "Today Date: " + date_string + "\n\n" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}\n {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}\n {{- "Do not use variables.\n\n" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- "\n\n" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- "<|eot_id|>" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}\n {{- "Given the following functions, please respond with a JSON for a function call " }}\n {{- "with its proper arguments that best answers the given prompt.\n\n" }}\n {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}\n {{- "Do not use variables.\n\n" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- "\n\n" }}\n {%- endfor %}\n {{- first_user_message + "<|eot_id|>"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception("This model only supports single tool-calls at once!") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}\n {{- "<|python_tag|>" + tool_call.name + ".call(" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '="' + arg_val + '"' }}\n {%- if not loop.last %}\n {{- ", " }}\n {%- endif %}\n {%- endfor %}\n {{- ")" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}\n {{- '{"name": "' + tool_call.name + '", ' }}\n {{- '"parameters": ' }}\n {{- tool_call.arguments | tojson }}\n {{- "}" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- "<|eom_id|>" }}\n {%- else %}\n {{- "<|eot_id|>" }}\n {%- endif %}\n {%- elif message.role == "tool" or message.role == "ipython" %}\n {{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- "<|eot_id|>" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}\n{%- endif %}\n`
Error: DriverError(CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain.") when loading dequantize_block_q4_K_f32
My system is an Nvidia Jetson AGX Orin 64 GB Developer Kit.
Click to show output of deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Orin"
CUDA Driver Version / Runtime Version 12.2 / 12.4
CUDA Capability Major/Minor version number: 8.7
Total amount of global memory: 62841 MBytes (65893945344 bytes)
(016) Multiprocessors, (128) CUDA Cores/MP: 2048 CUDA Cores
GPU Max Clock rate: 1300 MHz (1.30 GHz)
Memory Clock rate: 1300 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 167936 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.4, NumDevs = 1
Result = PASS
Could this be a problem with ARM support? Or maybe a build.rs
in some dependency is using the wrong version of nvcc
somehow? Also possible that this is a usage error on my part.
Latest commit or version
32e8945
, current master
as of writing. Also tried v0.3.1
with the same result.
nikolaydubina commented
yeah, cuda does not work for me neither for NVIDIA L4 #850
nikolaydubina commented
btw, verify your environment with this: http://github.com/nikolaydubina/basic-openai-pytorch-server