mgoin
LLM inference optimization and HPC Engineering Lead @neuralmagic Committer @vllm-project
@neuralmagicBoston
Pinned Repositories
advos
RISC-V OS in Rust with hardware support for SiFive's HiFive1 board
bfc
cnpy
Single header-only library to read and write Numpy files in C/C++
learned_indexes
Experiments on ideas proposed in Tim Kraska's "The Case for Learned Index Structures"
MPT-Medical-Chatbot
This is a medical bot built using MPT and Sentence Transformers. The bot is powered by DeepSparse, Langchain, and Chainlit. The bot runs on a decent CPU machine with a minimum of 16GB of RAM.
torch-fp8
torch_bitmask
Implementations for fast bitmask compression for weight sparsity in PyTorch
deepsparse
Sparsity-aware deep learning inference runtime for CPUs
sparseml
Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
mgoin's Repositories
mgoin/bfc
mgoin/learned_indexes
Experiments on ideas proposed in Tim Kraska's "The Case for Learned Index Structures"
mgoin/MPT-Medical-Chatbot
This is a medical bot built using MPT and Sentence Transformers. The bot is powered by DeepSparse, Langchain, and Chainlit. The bot runs on a decent CPU machine with a minimum of 16GB of RAM.
mgoin/torch_bitmask
Implementations for fast bitmask compression for weight sparsity in PyTorch
mgoin/torch-fp8
mgoin/advos
RISC-V OS in Rust with hardware support for SiFive's HiFive1 board
mgoin/amsterdam-demo
mgoin/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
mgoin/BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
mgoin/clip-retrieval
Easily compute clip embeddings and build a clip retrieval system with them
mgoin/dev_env
Holds dotfiles, scripts, and notes to quickly construct my preferred development environment.
mgoin/flash-attention
Fast and memory-efficient exact attention
mgoin/hf_model_stats
mgoin/huggingface.js
Utilities to use the Hugging Face Hub API
mgoin/inference
Reference implementations of MLPerf™ inference benchmarks
mgoin/langchain
⚡ Building applications with LLMs through composability ⚡
mgoin/llama-cpp-python
Python bindings for llama.cpp
mgoin/llmgoin
mgoin/llmperf
LLMPerf is a library for validating and benchmarking LLMs
mgoin/lm-evaluation-harness
A framework for few-shot evaluation of autoregressive language models.
mgoin/mgoin.github.io
mgoin/mistral-evals
mgoin/mteb
MTEB: Massive Text Embedding Benchmark
mgoin/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
mgoin/rol
Game of Life implemented in Rust
mgoin/sparsegpt
Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
mgoin/tinystories-sparsify
mgoin/transformers
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
mgoin/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
mgoin/webgl_signed_distance_fields