triton-inference-server/server
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
PythonBSD-3-Clause
Issues
- 2
- 0
- 0
[feature request] Real-time streaming inference load generation by `perf_analyzer`
#8059 opened by vadimkantorov - 0
OpenAI Frontend Batch Support
#8058 opened by Loc8888 - 1
RFE: Function calling in OpenAI Frontend
#8048 opened by thehumit - 3
Significant performance degradation when using OpenAI Frontend + streaming
#8045 opened by jolyons123 - 0
Performance Discrepancy Between NVIDIA Triton and Direct Faster-Whisper Inference
#8016 opened by YuBeomGon - 0
- 0
CUDA Race Condition in TensorRT GEMM Kernel with Triton Inference Server load tensorRT model
#8057 opened by neezeeyee - 2
- 0
InferenceServerException: [408] an exception occurred in the client while decoding the response: Parse error at offset 0: Invalid value.
#8051 opened by TopAgrume - 5
Segment fault crash due to race condition of request cancellation (with fix proposal)
#8034 opened by lunwang-ttd - 0
Bazel support and tag release for individual repos
#8049 opened by arpit15 - 4
[Question] How can I make a limit on the length of the input context and the number of tokens to generate?
#8029 opened by ArtemBiliksin - 3
`build.py` setting docker build args for secrets even when build-secret flag is not present
#7992 opened by BenjaminBraunDev - 0
How to Send FP16 Input Tensors Using gRPC in C# for NVIDIA Triton Inference Server?
#8044 opened by Madihaa-Shaikh - 0
- 0
- 2
Method 'forward' is not defined error !
#7968 opened by MHmi1 - 2
`k8s-onprem` Chart doesn't work with OpenShift's default security posture
#8004 opened by jharmison-redhat - 1
Python Backend on Windows
#8012 opened by mhbassel - 0
Triton llm openai langgraph toolcall
#8033 opened by GGN1994 - 0
Python backend without GIL
#8032 opened by zeruniverse - 0
Request Cancellation
#8030 opened by MichalPogodski - 1
First value replicated over entire input array
#8025 opened by FCollaPi - 4
- 2
Got run time error `0 active drivers ([]). There should only be one.` when using PipelineModule through ray and deepspeed
#8007 opened by consciousgaze - 0
Infinite pending status from 3 days after launching server
#8028 opened by nbowon - 1
leak memory
#8026 opened by aTunass - 0
- 1
"output tensor shape does not match size of output" when using python backend and providing a custom environment
#8019 opened by Isuxiz - 0
Inconsistent HF token requirements for cached gated models: Triton vs vLLM deployments
#8020 opened by haka-qylis - 4
Performance issue - High queue times in perf_analyzer
#7986 opened by asaff1 - 1
- 0
- 0
Unable to load model from S3 bucket
#8008 opened by jmlaubach - 1
Got CMake Error: "CMAKE_CUDA_ARCHITECTURES must be non-empty if set" during build without docker
#8003 opened by simonzgx - 1
- 0
ONNX Model IR Version 10 Support
#8001 opened by RohanAdwankar - 2
Can't build r25.01 (r24.12 builds okay) on `ubuntu-22.04` (unclear build errors). Also can't build `r24.12` on `ubuntu-24.04` (C++ errors)
#7997 opened by vadimkantorov - 0
The system looks the same, but errors occur on some machines, but the reason is unknown
#7996 opened by coder-2014 - 0
Batching
#7994 opened by riyajatar37003 - 0
- 1
ERROR: failed to solve: nvcr.io/nvidia/tritonserver:24.08-py3: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://nvcr.io/proxy_auth?scope=repository%3Anvidia%2Ftritonserver%3Apull: 401
#7988 opened by monajalal - 1
- 0
Unable to get response from bls async call
#7982 opened by riyajatar37003 - 0
- 0
[Question] triton-client numpy 2 support
#7979 opened by john-pixforce - 0
Pytorch backend: Model is run in no_grad mode even with INFERENCE_MODE=false
#7974 opened by hakanardo - 0
TypeError: object of type 'int' has no len()
#7967 opened by ProgramerSalar