Pinned Repositories
gptq
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
OBC
Code for the NeurIPS 2022 paper "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning".
PanzaMail
qmoe
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
QUIK
Repository for the QUIK project, enabling the use of 4bit kernels for generative inference
Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
sparsegpt
Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
WoodFisher
Code accompanying the NeurIPS 2020 paper: WoodFisher (Singh & Alistarh, 2020)
IST Austria Distributed Algorithms and Systems Lab's Repositories
IST-DASLab/gptq
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
IST-DASLab/sparsegpt
Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
IST-DASLab/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
IST-DASLab/qmoe
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
IST-DASLab/PanzaMail
IST-DASLab/QUIK
Repository for the QUIK project, enabling the use of 4bit kernels for generative inference
IST-DASLab/OBC
Code for the NeurIPS 2022 paper "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning".
IST-DASLab/Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
IST-DASLab/SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
IST-DASLab/RoSA
IST-DASLab/QIGen
Repository for CPU Kernel Generation for LLM Inference
IST-DASLab/spdy
Code for ICML 2022 paper "SPDY: Accurate Pruning with Speedup Guarantees"
IST-DASLab/peft-rosa
A fork of the PEFT library, supporting Robust Adaptation (RoSA)
IST-DASLab/sparseprop
IST-DASLab/MicroAdam
This repository contains code for the MicroAdam paper.
IST-DASLab/CrAM
Code for reproducing the results from "CrAM: A Compression-Aware Minimizer" accepted at ICLR 2023
IST-DASLab/spops
IST-DASLab/ISTA-DASLab-Optimizers
IST-DASLab/Mathador-LM
Code for the paper "Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on LLMs".
IST-DASLab/CAP
Repository for Correlation Aware Prune (NeurIPS23) source and experimental code
IST-DASLab/EFCP
The repository contains code to reproduce the experiments from our paper Error Feedback Can Accurately Compress Preconditioners available below:
IST-DASLab/pruned-vision-model-bias
Code for reproducing the paper "Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures"
IST-DASLab/TACO4NLP
Task aware compression for various NLP tasks
IST-DASLab/KDVR
Code for the experiments in Knowledge Distillation Performs Partial Variance Reduction, NeurIPS 2023
IST-DASLab/ZipLM
Code for the NeurIPS 2023 paper: "ZipLM: Inference-Aware Structured Pruning of Language Models".
IST-DASLab/AutoGPTQRoSA
IST-DASLab/FastOBQ-
GPTQ with finetuning
IST-DASLab/GridSearcher
GridSearcher simplifies running grid searches for machine learning projects in Python, emphasizing parallel execution and GPU scheduling without dependencies on SLURM or other workload managers.
IST-DASLab/llm-foundry
LLM training code for Databricks foundation models
IST-DASLab/SPADE
Code of SPADE: Sparsity Guided Debugging for Deep Neural Networks