Getting reasonable performance on dual RTX 3090 and 128gb
trilog-inc opened this issue · 7 comments
Hi,
First off thanks for all the work you guys have put into this.
I am trying to run DeepSeek-Coder-V2-Instruct-0724-GGUF Q4_K_M with reasonable performance but cannot figure it out. When i use the default configuration of the "DeepSeek-V2-Chat-multi-gpu.yaml" optimize file, I get about 0.7 t/s. I have tried to load some of the expert layers to the cuda:0 and cuda:1 but hit OOM errors when more than 1 layer is used. Example Yaml match:
- match:
name: "^model\\.layers\\.(0|[1])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
kwargs:
prefill_device: "cuda:0"
prefill_op: "KExpertsTorch"
generate_device: "cuda:0"
generate_op: "KExpertsTorch" # do remember using correct backend, KExpertsCPU only runable on cpu.
out_device: "cuda:0"
recursive: False # don't recursively inject submodules of this module
- match:
name: "^model\\.layers\\.(0|[2-9]|[12][0-9])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
kwargs:
prefill_device: "cuda:0"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:0"
recursive: False # don't recursively inject submodules of this module
GPU Usage:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 71956 C ...onda3/envs/ktransformers/bin/python 5156MiB |
| 1 N/A N/A 71956 C ...onda3/envs/ktransformers/bin/python 6864MiB |
+-----------------------------------------------------------------------------------------+
Has any one been able to achieve reasonable results with this sort of setup?
System:
13th Gen Intel(R) Core(TM) i5-13600K
128GB DDR4 3200 ( 4 x 32GB )
2x RTX 3090
Hi, thanks for your interest about ktransformers.
Deepseekv2's Q4-km requires 136G RAM, the data will frequently swap in and out in your RAM if you only got 128G, which slashed your generate speed. My advise is increase your ram or use IQ4_XS format model (125G).
Hi Azure, thanks for the reply.
Unfortunately I am using a consumer motherboard on this setup and the ram is maxed at 128GB.
However, I tried the IQ4_XS format with the no optimize config and the results are better.
prompt eval count: 26 token(s)
prompt eval duration: 1.7585856914520264s
prompt eval rate: 14.784607953071857 tokens/s
eval count: 921 token(s)
eval duration: 135.37466645240784s
eval rate: 6.803340862330292 tokens/s
When i try to load it with the default DeepSeek-V2-Chat-multi-gpu.yaml, I get the following CUDA error as it starts to load into the second GPU
...
loading blk.29.ffn_norm.weight to cuda:0
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/mnt/data/myfrienderic/ktransformers/ktransformers/local_chat.py", line 159, in <module>
fire.Fire(local_chat)
File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/myfrienderic/ktransformers/ktransformers/local_chat.py", line 106, in local_chat
optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config)
File "/mnt/data/myfrienderic/ktransformers/ktransformers/optimize/optimize.py", line 129, in optimize_and_load_gguf
load_weights(module, gguf_loader)
File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 83, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 85, in load_weights
module.load()
File "/mnt/data/myfrienderic/ktransformers/ktransformers/operators/base_operator.py", line 60, in load
utils.load_weights(child, self.gguf_loader, self.key+".")
File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 83, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 83, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 83, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 85, in load_weights
module.load()
File "/mnt/data/myfrienderic/ktransformers/ktransformers/operators/base_operator.py", line 60, in load
utils.load_weights(child, self.gguf_loader, self.key+".")
File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 83, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 85, in load_weights
module.load()
File "/mnt/data/myfrienderic/ktransformers/ktransformers/operators/linear.py", line 422, in load
self.generate_linear.load(w=w)
File "/mnt/data/myfrienderic/ktransformers/ktransformers/operators/linear.py", line 207, in load
w_ref, marlin_q_w, marlin_s, g_idx, sort_indices, _ = marlin_quantize(
^^^^^^^^^^^^^^^^
File "/mnt/data/myfrienderic/ktransformers/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/marlin_utils.py", line 93, in marlin_quantize
w_ref, q_w, s, g_idx, rand_perm = quantize_weights(w, num_bits, group_size,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/myfrienderic/ktransformers/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/quant_utils.py", line 61, in quantize_weights
w = w.reshape((group_size, -1))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Loading the Q4_KM with the same config completes correctly, but suffers from the aforementioned bad performance.
Would it be possible to eventually leverage the extra 24GB of VRAM ( + 12GB unused on the first GPU ) to load a larger model than the system ram can handle? As in is there a way to configure the optimize config to offload more of the model on the GPU to compensate
This is a bug, I just fixed it.
About your problem.
Would it be possible to eventually leverage the extra 24GB of VRAM ( + 12GB unused on the first GPU ) to load a larger model than the system ram can handle? As in is there a way to configure the optimize config to offload more of the model on the GPU to compensate
Maybe you can consider modify your yaml, offload some of experts from CPU to GPU to utilize your extra VRAM. You can find detailed tutorial here.
Thanks for the update!
I will test this throughout the weekend.
Do you have an intuition on which parameters i should try to load first? I tried with the "ktransformers.operators.experts.KTransformersExperts" class but triggered an OOM on 1 layer .. Not sure where to go next and would love your input.
Which backend you are using for ktransformers.operators.experts.KTransformersExperts
?
Using the following yaml modification to the yaml
- match:
name: "^model\\.layers\\.(0|1)\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
kwargs:
prefill_device: "cuda:0"
prefill_op: "KExpertsMarlin"
generate_device: "cuda:0"
generate_op: "KExpertsTorch" # do remember using correct backend, KExpertsCPU only runable on cpu.
out_device: "cuda:0"
recursive: False # don't recursively inject submodules of this module
- match:
name: "^model\\.layers\\.([2-9]|[12][0-9])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
kwargs:
prefill_device: "cuda:0"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:0"
recursive: False # don't recursively inject submodules of this module
If I use the Marlin Backend, The VRAM usage on the first GPU hits ~22GB usage during loading then settles down to ~12GB after loading.
During Loading:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A |
| 61% 55C P2 176W / 370W | 22343MiB / 24576MiB | 42% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A |
| 0% 52C P8 19W / 420W | 3MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 10202 C ...onda3/envs/ktransformers/bin/python 22334MiB |
+-----------------------------------------------------------------------------------------+
After Loading:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A |
| 0% 44C P8 48W / 370W | 12877MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A |
| 0% 54C P8 18W / 420W | 7059MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 10202 C ...onda3/envs/ktransformers/bin/python 12868MiB |
| 1 N/A N/A 10202 C ...onda3/envs/ktransformers/bin/python 7050MiB |
+-----------------------------------------------------------------------------------------+
When I try to generate anything with the web UI, I get the following error in the command line:
/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/contextlib.py:105: FutureWarning: `torch.backends.cuda.sdp_kernel()` is deprecated. In the future, this context manager will be removed. Please see `torch.nn.attention.sdpa_kernel()` for the new context manager, with updated signature.
self.gen = func(*args, **kwds)
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 257, in __call__
await wrap(partial(self.listen_for_disconnect, receive))
File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 253, in wrap
await func()
File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 230, in listen_for_disconnect
message = await receive()
^^^^^^^^^^^^^^^
File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 534, in receive
await self.message_event.wait()
File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/asyncio/locks.py", line 213, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f3b31cbcd50
During handling of the above exception, another exception occurred:
+ Exception Group Traceback (most recent call last):
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
| result = await app( # type: ignore[func-returns-value]
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
| return await self.app(scope, receive, send)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
| await super().__call__(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/applications.py", line 113, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
| raise exc
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
| await self.app(scope, receive, _send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/cors.py", line 93, in __call__
| await self.simple_response(scope, receive, send, request_headers=headers)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/cors.py", line 144, in simple_response
| await self.app(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
| await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
| raise exc
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
| await app(scope, receive, sender)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 715, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 735, in app
| await route.handle(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 288, in handle
| await self.app(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 76, in app
| await wrap_app_handling_exceptions(app, request)(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
| raise exc
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
| await app(scope, receive, sender)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 74, in app
| await response(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 250, in __call__
| async with anyio.create_task_group() as task_group:
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
| raise BaseExceptionGroup(
| ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/cuda_graph_runner.py", line 41, in capture
| logits=model(inputs_embeds=inputs_embeds,
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
| return self._call_impl(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
| return forward_call(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/models/modeling_deepseek.py", line 1731, in forward
| outputs = self.model(
| ^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
| return self._call_impl(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
| return forward_call(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/models.py", line 719, in forward
| layer_outputs = decoder_layer(
| ^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
| return self._call_impl(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
| return forward_call(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/models/modeling_deepseek.py", line 1254, in forward
| hidden_states = self.mlp(hidden_states)
| ^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
| return self._call_impl(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
| return forward_call(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 652, in forward
| y = self.moe_on_cpuinfer(hidden_states, topk_idx, topk_weight).view(*orig_shape).to(device=hidden_states.device)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
| return func(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 674, in moe_on_cpuinfer
| outs = self.experts(x, topk_ids, topk_weight)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
| return self._call_impl(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
| return forward_call(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 503, in forward
| return self.generate_experts.forward(input_tensor, expert_ids, weights)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 424, in forward
| idx, top_x = torch.where(expert_mask[expert_idx])
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| RuntimeError: CUDA error: operation not permitted when stream is capturing
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
|
|
| During handling of the above exception, another exception occurred:
|
| Traceback (most recent call last):
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 253, in wrap
| await func()
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 242, in stream_response
| async for chunk in self.body_iterator:
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 80, in check_client_link
| async for event in async_events:
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 93, in to_stream_reply
| async for event in async_events:
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 87, in add_done
| async for event in async_events:
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 101, in filter_api_event
| async for event in async_events:
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/runs.py", line 28, in inner
| async for event in ctx.work():
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/base.py", line 145, in work
| async for token in self.interface.inference(local_messages,self.thread.id):
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 330, in inference
| for t in self.generate():
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
| response = gen.send(None)
| ^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 290, in generate
| next_token = self.decode_one_tokens()
| ^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/ktransformers.py", line 58, in decode_one_tokens
| self.cuda_graph_runner.capture(self.model, self.current_ids, self.active_cache_position.unsqueeze(0), self.active_cache_position, self.cache, main_device=torch_device, return_dict=False, use_cache=True)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/cuda_graph_runner.py", line 40, in capture
| with torch.cuda.graph(self.graph, stream = capture_stream):
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/cuda/graphs.py", line 185, in __exit__
| self.cuda_graph.capture_end()
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/cuda/graphs.py", line 83, in capture_end
| super().capture_end()
| RuntimeError: CUDA error: operation failed due to a previous error during capture
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
|
+------------------------------------
The same Error occurs if I load it with the Torch expert:
During handling of the above exception, another exception occurred:
+ Exception Group Traceback (most recent call last):
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
| result = await app( # type: ignore[func-returns-value]
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
| return await self.app(scope, receive, send)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
| await super().__call__(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/applications.py", line 113, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
| raise exc
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
| await self.app(scope, receive, _send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/cors.py", line 93, in __call__
| await self.simple_response(scope, receive, send, request_headers=headers)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/cors.py", line 144, in simple_response
| await self.app(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
| await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
| raise exc
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
| await app(scope, receive, sender)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 715, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 735, in app
| await route.handle(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 288, in handle
| await self.app(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 76, in app
| await wrap_app_handling_exceptions(app, request)(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
| raise exc
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
| await app(scope, receive, sender)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 74, in app
| await response(scope, receive, send)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 250, in __call__
| async with anyio.create_task_group() as task_group:
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
| raise BaseExceptionGroup(
| ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/cuda_graph_runner.py", line 41, in capture
| logits=model(inputs_embeds=inputs_embeds,
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
| return self._call_impl(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
| return forward_call(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/models/modeling_deepseek.py", line 1731, in forward
| outputs = self.model(
| ^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
| return self._call_impl(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
| return forward_call(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/models.py", line 719, in forward
| layer_outputs = decoder_layer(
| ^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
| return self._call_impl(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
| return forward_call(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/models/modeling_deepseek.py", line 1254, in forward
| hidden_states = self.mlp(hidden_states)
| ^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
| return self._call_impl(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
| return forward_call(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 652, in forward
| y = self.moe_on_cpuinfer(hidden_states, topk_idx, topk_weight).view(*orig_shape).to(device=hidden_states.device)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
| return func(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 674, in moe_on_cpuinfer
| outs = self.experts(x, topk_ids, topk_weight)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
| return self._call_impl(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
| return forward_call(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 503, in forward
| return self.generate_experts.forward(input_tensor, expert_ids, weights)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 424, in forward
| idx, top_x = torch.where(expert_mask[expert_idx])
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| RuntimeError: CUDA error: operation not permitted when stream is capturing
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
|
|
| During handling of the above exception, another exception occurred:
|
| Traceback (most recent call last):
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 253, in wrap
| await func()
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 242, in stream_response
| async for chunk in self.body_iterator:
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 80, in check_client_link
| async for event in async_events:
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 93, in to_stream_reply
| async for event in async_events:
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 87, in add_done
| async for event in async_events:
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 101, in filter_api_event
| async for event in async_events:
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/runs.py", line 28, in inner
| async for event in ctx.work():
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/base.py", line 145, in work
| async for token in self.interface.inference(local_messages,self.thread.id):
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 330, in inference
| for t in self.generate():
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
| response = gen.send(None)
| ^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 290, in generate
| next_token = self.decode_one_tokens()
| ^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/ktransformers.py", line 58, in decode_one_tokens
| self.cuda_graph_runner.capture(self.model, self.current_ids, self.active_cache_position.unsqueeze(0), self.active_cache_position, self.cache, main_device=torch_device, return_dict=False, use_cache=True)
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/cuda_graph_runner.py", line 40, in capture
| with torch.cuda.graph(self.graph, stream = capture_stream):
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/cuda/graphs.py", line 185, in __exit__
| self.cuda_graph.capture_end()
| File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/cuda/graphs.py", line 83, in capture_end
| super().capture_end()
| RuntimeError: CUDA error: operation failed due to a previous error during capture
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
|
+------------------------------------
Any ideas on how to debug this?
+1 to this. I am also impacted. I have RTX 3090. When I try to use 0 and 1st layers of experts with
prefill_op: "KExpertsMarlin" and generate_op: "KExpertsTorch", VRAM fills out to ~17.5GB and it loads the model fine but when I submit a prompt in the UI, I get the error that @myfrienderic shared above.
@Azure-Tang Please, let us know if there is a fix.
Thanks!