Pinned Repositories
AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
accelerated-pytorch-transformers-generation
Anki-Furigana-Creator
An add-on for Anki to generate furigana on demand during Japanese vocabulary card creation
bettertransformer_demo
An end-to-end gradio demo of BetterTransformer integration with 🤗 Transformers, using TorchServe or HF's Inference Endpoints
directvoxgo-mareva
Easy custom datasets and visualization of new sythetized views from DirectVoxGO
efficient-attention-benchmark
Benchmarking PyTorch eager vs torch.nn.functional.scaled_dot_product_attention vs HazyResearch implementation
gpu-gemm-hierarchy
A description of a simple GEMM hierarchy on Nvidia GPUs, as used in CUTLASS
q4f16-gemm-gemv-benchmark
rikai-mpv
A port of Rikaichamp Japanese dictionary and parser into mpv video player
optimum
🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
fxmarty's Repositories
fxmarty/accelerated-pytorch-transformers-generation
fxmarty/flash-attention-rocm
Fast and memory-efficient exact attention
fxmarty/hgemm_vs_gemmex
fxmarty/pyrsmi
python package of rocm-smi-lib
fxmarty/transformers-regression-test
fxmarty/vllm-public
A high-throughput and memory-efficient inference and serving engine for LLMs
fxmarty/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
fxmarty/autogptq-test
fxmarty/bench-flash
fxmarty/dummy-repo
fxmarty/exllama-kernels
q4f16 kernel extracted from exllama
fxmarty/exllamav2
A fast inference library for running LLMs locally on modern consumer-class GPUs
fxmarty/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
fxmarty/Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
fxmarty/neural-compressor
Provide unified APIs for SOTA model compression techniques, such as low precision (INT8/INT4/FP4/NF4) quantization, sparsity, pruning, and knowledge distillation on mainstream AI frameworks such as TensorFlow, PyTorch, and ONNX Runtime.
fxmarty/onnxruntime
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
fxmarty/optimum
🏎️ Accelerate training and inference of 🤗 Transformers with easy to use hardware optimization tools
fxmarty/optimum-benchmark
A repository for benchmarking HF Optimum's optimizations for inference and training.
fxmarty/optimum-nvidia
fxmarty/optimum-quanto
A pytorch quantization backend for optimum
fxmarty/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
fxmarty/rocm-vllm
fxmarty/safetensors
Simple, safe way to store and distribute tensors
fxmarty/sentence-transformers
Multilingual Sentence & Image Embeddings with BERT
fxmarty/test-github-actions-environments
fxmarty/text-embeddings-inference
A blazing fast inference solution for text embeddings models
fxmarty/text-generation-inference
Large Language Model Text Generation Inference
fxmarty/torch_library_playground
fxmarty/transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
fxmarty/transformers-hard-fork