vectorch-ai/ScaleLLM

A high-performance inference system for large language models, designed for production environments.

C++Apache-2.0

Pinned issues

ScaleLLM Roadmap

#84 opened 6 months ago by guocuimi

Open3

Issues

Will the result callback called in a threadsafe/coruntine safe way? #322
#323 opened a month ago by tp-nan
7
Mistral large GPTQ model inference problem
#308 opened a month ago by drdaliang
3
RuntimeError: Timed out
#310 opened a month ago by spongxin
1
The process terminated before reaching the specified max_tokens after setting ignore_ros=True and max_tokens.
#304 opened a month ago by HowardChenRV
3
[Issue] Qwen-14B-Chat init fail and performance issue.
#275 opened 2 months ago by liutongxuan
2
Deployment of glm-4-9b-chat model fails with SentencePiece tokenizer error
#291 opened 2 months ago by dengyingxu
4
Is there any plans to support Int8 weight quant ?
#276 opened 2 months ago by sitabulaixizawaluduo
2
pytest core dump in workflow
#258 opened 3 months ago by guocuimi
0
ScaleLLM vs vLLM in performance
#144 opened 5 months ago by WangErXiao
20
cuda graph capture may occasionally become stuck with multiple gpus.
#131 opened 3 months ago by guocuimi
0
pip install scalellm failure.
#212 opened 4 months ago by liutongxuan
1
install cpython shared lib in manylinux docker image
#215 opened 4 months ago by guocuimi
0
ScaleLLM Roadmap
#84 opened 6 months ago by guocuimi
3
[Core] core on the chatglm3 model using scalellm.
#221 opened 4 months ago by liutongxuan
1
[Correctness] Output incorrect on the baichuan2 model using scalellm.
#222 opened 4 months ago by liutongxuan
1
[Correctness] Using llama-2-7b-hf, scalellm's output is different with vllm's output.
#220 opened 4 months ago by liutongxuan
0
Developing Python wrapper for easier integration
#161 opened 4 months ago by guocuimi
0
Adding more bechmarks and unittests for kernels and dependencies
#157 opened 5 months ago by guocuimi
0
does the current openai-copatible-API support function calls?
#121 opened 6 months ago by lyj555
1
LoRA: QLoRA/S-LoRA: Serving thousands of LoRA adapters
#166 opened 5 months ago by guocuimi
0
Introducing the Mamba model
#165 opened 5 months ago by guocuimi
0
Introducing a ring attention mechanism for handling long contexts
#164 opened 5 months ago by guocuimi
0
Quantization: Supporting FP8 for both models and KV caches
#163 opened 5 months ago by guocuimi
0
Enhancing documentation for improved usability
#162 opened 5 months ago by guocuimi
0
Exploring other chips such as TPU, etc.
#160 opened 5 months ago by guocuimi
0
Loosening coupling with PyTorch for easy deployment
#159 opened 5 months ago by guocuimi
0
Adding more Prometheus metrics and creating a Grafana dashboard for monitoring.
#158 opened 5 months ago by guocuimi
0
Extending support to macOS and Windows platforms
#156 opened 5 months ago by guocuimi
0
Structural Decoding: Function Calling
#155 opened 5 months ago by guocuimi
0
Structural Decoding: Json format
#154 opened 5 months ago by guocuimi
0
Structural Decoding: Json format
#153 opened 5 months ago by guocuimi
0
GPU Arch: Turing architecture (sm75)
#152 opened 5 months ago by guocuimi
0
Adding support for Apple chips
#151 opened 5 months ago by guocuimi
0
Introducing multi-modal models (LLaVA model)
#150 opened 5 months ago by guocuimi
0
Implementing MoE (Mixture of Experts) kernels
#149 opened 5 months ago by guocuimi
0
Implementing fused FFN (Feed-Forward Network) to enhance efficiency
#148 opened 5 months ago by guocuimi
0
Exploring the feasibility of adopting the flashinfer library
#147 opened 5 months ago by guocuimi
0
Exploring lookahead decoding support
#146 opened 5 months ago by guocuimi
0
Support for Visual Models (i.e. LLaVA)
#75 opened 6 months ago by omarmhaimdat
4
[baichuan2-7b] random core dump in offline batched inference.
#83 opened 6 months ago by liutongxuan
2
is there any way or chance to use flash_attn 1.x ? to support more gpus
#69 opened 7 months ago by dalamudx
2
The output from the API lacks "usage" content, which is causing compatibility issues when trying to use the API with other tools.
#34 opened 7 months ago by BUJIDAOVS
2
Driver Version: 535.54.03 CUDA Version: 12.2 ，运行报错“OpenAI API returned an error 503: {"error":{"code":14,"message":"connection error: desc = \"transport: Error while dialing: dial tcp: lookup scalellm on 127.0.0.11:53: server misbehaving\""}}”
#37 opened 7 months ago by Missliuff
3
Tensor Cores did not get activated when input tensor is not multiple of 8
#48 opened 7 months ago by guocuimi
1
can support mac m1 ?
#49 opened 8 months ago by zyxcambridge
1
How to Change gRPC server IP in REST API Server docker?
#30 opened 9 months ago by gunpal5
1
How to implement response stream in custom UI?
#39 opened 9 months ago by latheesan-k
1
scalellm exited with code 137
#38 opened 9 months ago by yisiliang
2
grpc server connection error
#32 opened 9 months ago by Arcmoon-Hu
9
使用yi-34b时模型不会主动提停止生成，会不停地生成低质量的重复的内容，应该如何调整？
#31 opened 9 months ago by BUJIDAOVS
3