tenstorrent/tt-inference-server

PythonApache-2.0

Issues

vLLM run script prefill + decode trace pre-capture to avoid TTFT on first completions being unexpectedly high or stalling
#56 opened 14 days ago by tstescoTT
0
Running mock model in some cases hits import error
#53 opened 16 days ago by tstescoTT
0
Provide example chat template usage
#36 opened a month ago by tstescoTT
1
Add handling for RAG context
#9 opened 23 days ago by tstescoTT
2
Missing `--max_prompt_length` argument running example_requests_client_alpaca_eval.py
#51 opened 25 days ago by milank94
1
Very slow speed
#40 opened a month ago by changh95
1
Initial vLLM setup fails due to missing HuggingFace permissions
#37 opened a month ago by milank94
8
Add status messaging and endpoint to allow for client-side users to reason about model initialization and life cycle.
#17 opened 3 months ago by tstescoTT
1
Simple E2E quick-start.sh
#10 opened a month ago by tt-mjudge
3
Add linting / formatting checks on PRs
#7 opened a month ago by tstescoTT
1
Add tt-metal-falcon-7b from tt studio into tt-inference-server
#34 opened 2 months ago by anirudTT
0
Docker run support for HF_TOKEN authentication using env var pass in
#23 opened 2 months ago by tstescoTT
0
Fix TypeError in lc addition script post dynamic year being added
#22 opened 2 months ago by anirudTT
2
Make send_user_stats a request parameter to allow client to choose
#16 opened 3 months ago by tstescoTT
0
llama model install script support for llama CLI and huggingface hub
#14 opened 3 months ago by tstescoTT
0
Capture tt-metal and tt-NN loguru logs in inference server python log files
#13 opened 3 months ago by tstescoTT
0
Support for Llama 3.1 8B
#11 opened 4 months ago by tt-mjudge
1
Llama 3.1 70B T3K inference server prefill+decode
#3 opened 4 months ago by tstescoTT
3
Llama 3.1 70B T3K inference server batch_top_pk_logits_efficient worst case ~20ms latency
#5 opened 4 months ago by tstescoTT
2
Llama 3.1 70B T3k inference server performance degradation
#4 opened 4 months ago by tstescoTT
2