Pinned Repositories
deepsparse-continuous-batching
DeepSparse Continuous Batching
gpu-profiling
GPU Profiling
langchain-gpt
Code-generation for Langchain framework
llm-compressor-example
Example using llm-compressor
marlin-example
Example of quantizing and saving a model with Marlin
mistral-self-rag
Training mistral on self-rag task
vllm-benchmarking
Benchmarking vLLM
vllm-k8s
Example deploying vLLM on GKE
llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
robertgshaw2-neuralmagic's Repositories
robertgshaw2-neuralmagic/vllm-k8s
Example deploying vLLM on GKE
robertgshaw2-neuralmagic/deepsparse-continuous-batching
DeepSparse Continuous Batching
robertgshaw2-neuralmagic/llm-compressor-example
Example using llm-compressor
robertgshaw2-neuralmagic/marlin-example
Example of quantizing and saving a model with Marlin
robertgshaw2-neuralmagic/mistral-self-rag
Training mistral on self-rag task
robertgshaw2-neuralmagic/vllm-benchmarking
Benchmarking vLLM
robertgshaw2-neuralmagic/auto-fp8
Making FP8 Checkpoints
robertgshaw2-neuralmagic/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
robertgshaw2-neuralmagic/bert-benchmarking
Repo for benchmarking bert performance under various scenarios
robertgshaw2-neuralmagic/bert-server-example
DeepSparse Server Running BERT
robertgshaw2-neuralmagic/zephyr-training
Recreating and playing with zephyr
robertgshaw2-neuralmagic/accelerate
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
robertgshaw2-neuralmagic/buildkite-ci
robertgshaw2-neuralmagic/chat-example
Example calling chat api
robertgshaw2-neuralmagic/deepsparse-llm-server-example
example for deepsparse llm in basic server
robertgshaw2-neuralmagic/FastChat
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
robertgshaw2-neuralmagic/gptq-benchmarking
Benchmarking gptq performance and how the kernels work
robertgshaw2-neuralmagic/gptq-experiments
Experiments running GPTQ
robertgshaw2-neuralmagic/gptq-serialization-example
Example of gptq serialization
robertgshaw2-neuralmagic/lm-evaluation-harness
A framework for few-shot evaluation of language models.
robertgshaw2-neuralmagic/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
robertgshaw2-neuralmagic/nm-vllm-example
Example running nm-vllm
robertgshaw2-neuralmagic/one-shot-mpt-gsm-8k
Experiments for applying one shot
robertgshaw2-neuralmagic/sparse-finetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
robertgshaw2-neuralmagic/tgi-benchmarking
Benchmarking LLMs on GPUs
robertgshaw2-neuralmagic/transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
robertgshaw2-neuralmagic/viggo-finetuning
Example finetuning an LLM on viggo dataset
robertgshaw2-neuralmagic/vllm-client
Client for benchmarking vllm
robertgshaw2-neuralmagic/vllm-examples
Example benchmarking vLLM
robertgshaw2-neuralmagic/vllm-qa-basic-correctness
Repo for basic correctness of vllm