Qubitium
Golang, Python, Kotlin. GPTQModel maintainer and OSS contributor to SGLang, vLLM, and others. @ModelCloudAi founder
ModelCloud.aiEarth/Epoch 2.0
Pinned Repositories
Device-SMI
Self-contained Python lib with zero-dependencies that give you a unified device properties for gpu, cpu, and npu. No more calling separate tools such as nvidia-smi or /proc/cpuinfo and parsing it yourself.
GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.
LogBar
A unified Logger and ProgressBar util with zero dependencies.
Tokenicer
A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.
AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
femtozip
list
Do you want a 9 KB cross-browser native JavaScript that makes your plain HTML lists super flexible, searchable, sortable and filterable? Yeah! Do you also want the possibility to add, edit and remove items by dead simple templating? Hell yeah!
php-cityhash
PHP extension for google's ultrafast cityhash library.
sglang
SGLang is a fast serving framework for large language models and vision language models.
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Qubitium's Repositories
Qubitium/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Qubitium/alpaca-lora
Instruct-tune LLaMA on consumer hardware
Qubitium/flash-attention
Fast and memory-efficient exact attention
Qubitium/flashinfer
FlashInfer: Kernel Library for LLM Serving
Qubitium/gemma_pytorch
The official PyTorch implementation of Google's Gemma models
Qubitium/lm-format-enforcer
Enforce the output format (JSON Schema, Regex etc) of a language model
Qubitium/sglang
SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
Qubitium/accelerate
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
Qubitium/auto-round
SOTA Weight-only Quantization Algorithm for LLMs
Qubitium/AutoAWQ
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
Qubitium/BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
Qubitium/clod-code
rot13 version of claw code
Qubitium/datasets
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Qubitium/duskpilot-c3-clone
Qubitium/ethos-paper
Qubitium/evalplus
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
Qubitium/GPTQ-for-LLaMa
4 bits quantization of LLaMa using GPTQ
Qubitium/GPTQ-triton
GPTQ inference Triton kernel
Qubitium/GPTQModel
Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
Qubitium/hqq
Official implementation of Half-Quadratic Quantization (HQQ)
Qubitium/hyperDB
A hyper-fast local vector database for use with LLM Agents. Now accepting SAFEs at $35M cap.
Qubitium/llama.cpp
Port of Facebook's LLaMA model in C/C++
Qubitium/mav
model activation visualiser
Qubitium/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Qubitium/qlora
QLoRA: Efficient Finetuning of Quantized LLMs
Qubitium/QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
Qubitium/tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
Qubitium/transformers
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Qubitium/unsloth
5X faster 60% less memory QLoRA finetuning
Qubitium/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs