mgoin

LLM inference optimization and HPC Engineering Lead @neuralmagic Committer @vllm-project

@neuralmagicBoston

Pinned Repositories

advos
RISC-V OS in Rust with hardware support for SiFive's HiFive1 board
Language:Rust0 2 00
bfc
Language:Brainfuck3 2 00
cnpy
Single header-only library to read and write Numpy files in C/C++
Language:C++1 1 00
learned_indexes
Experiments on ideas proposed in Tim Kraska's "The Case for Learned Index Structures"
Language:Python3 2 00
MPT-Medical-Chatbot
This is a medical bot built using MPT and Sentence Transformers. The bot is powered by DeepSparse, Langchain, and Chainlit. The bot runs on a decent CPU machine with a minimum of 16GB of RAM.
Language:Python3 0 01
torch-fp8
Language:Python1 1 00
torch_bitmask
Implementations for fast bitmask compression for weight sparsity in PyTorch
Language:Python3 1 00
deepsparse
Sparsity-aware deep learning inference runtime for CPUs
Language:Python3k 57 138173
sparseml
Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
Language:Python2.1k 48 206148
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Language:Python30.7k 249 5.3k4.7k

mgoin's Repositories

mgoin/bfc
Language:Brainfuck3 2 00
mgoin/learned_indexes
Experiments on ideas proposed in Tim Kraska's "The Case for Learned Index Structures"
Language:Python3 2 00
mgoin/MPT-Medical-Chatbot
This is a medical bot built using MPT and Sentence Transformers. The bot is powered by DeepSparse, Langchain, and Chainlit. The bot runs on a decent CPU machine with a minimum of 16GB of RAM.
Language:Python3 0 01
mgoin/torch_bitmask
Implementations for fast bitmask compression for weight sparsity in PyTorch
Language:Python3 1 00
mgoin/torch-fp8
Language:Python1 1 00
mgoin/advos
RISC-V OS in Rust with hardware support for SiFive's HiFive1 board
Language:Rust0 2 00
mgoin/amsterdam-demo
Language:Python
mgoin/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Language:Python0 0
mgoin/BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
mgoin/clip-retrieval
Easily compute clip embeddings and build a clip retrieval system with them
Language:Jupyter Notebook0 0
mgoin/dev_env
Holds dotfiles, scripts, and notes to quickly construct my preferred development environment.
Language:Shell2 0
mgoin/flash-attention
Fast and memory-efficient exact attention
Language:Python0 0
mgoin/hf_model_stats
Language:Python1 0
mgoin/huggingface.js
Utilities to use the Hugging Face Hub API
mgoin/inference
Reference implementations of MLPerf™ inference benchmarks
Language:Python0 0
mgoin/langchain
⚡ Building applications with LLMs through composability ⚡
Language:Python
mgoin/llama-cpp-python
Python bindings for llama.cpp
Language:Python0 0
mgoin/llmgoin
Language:Python2 0
mgoin/llmperf
LLMPerf is a library for validating and benchmarking LLMs
Language:Python0 0
mgoin/lm-evaluation-harness
A framework for few-shot evaluation of autoregressive language models.
Language:Python0 0
mgoin/mgoin.github.io
Language:HTML2 0
mgoin/mistral-evals
mgoin/mteb
MTEB: Massive Text Embedding Benchmark
Language:Python0 0
mgoin/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
mgoin/rol
Game of Life implemented in Rust
Language:Rust2 0
mgoin/sparsegpt
Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
Language:Python
mgoin/tinystories-sparsify
1 0
mgoin/transformers
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
Language:Python1 0
mgoin/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Language:Python0 0
mgoin/webgl_signed_distance_fields
Language:JavaScript1 0