Pinned Repositories
truss
The simplest way to serve AI/ML models in production
flux_fp8
Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
test_70b
truss-examples
Examples of models deployable with Truss
unmanic-documentation
All documentation for Unmanic
unsloth
Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
unsloth
Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
dsingal0's Repositories
dsingal0/flux_fp8
Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
dsingal0/landing_page
Landing page for qdrant.tech
dsingal0/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
dsingal0/test_70b
dsingal0/truss-examples
Examples of models deployable with Truss
dsingal0/unmanic-documentation
All documentation for Unmanic
dsingal0/unsloth
Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
dsingal0/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs