a32543254

Pinned Repositories

GenAIComps
GenAI components at micro-service level; GenAI service composer to create mega-service
Language:Python00
ipex-llm
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Language:Python00
neural-compressor
Intel® Neural Compressor (formerly known as Intel® Low Precision Optimization Tool), targeting to provide unified APIs for network compression technologies, such as low precision quantization, sparsity, pruning, knowledge distillation, across different deep learning frameworks to pursue optimal inference performance.
Language:Python0 0 00
onnxruntime
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Language:C++0 0 00
streaming-llm
Efficient Streaming Language Models with Attention Sinks
Language:Python0 0 00
transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Language:Python0 0 00
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Language:Python00
fmt
A modern formatting library
Language:C++22.9k 318 2.9k2.8k
spdlog
Fast C++ logging library.
Language:C++27.6k 441 2.3k5k
neural-compressor
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
Language:Python2.5k 31 220282

a32543254's Repositories

a32543254/GenAIComps
GenAI components at micro-service level; GenAI service composer to create mega-service
Language:Python00
a32543254/ipex-llm
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Language:Python00
a32543254/neural-compressor
Intel® Neural Compressor (formerly known as Intel® Low Precision Optimization Tool), targeting to provide unified APIs for network compression technologies, such as low precision quantization, sparsity, pruning, knowledge distillation, across different deep learning frameworks to pursue optimal inference performance.
Language:Python0 0 00
a32543254/onnxruntime
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Language:C++0 0 00
a32543254/streaming-llm
Efficient Streaming Language Models with Attention Sinks
Language:Python0 0 00
a32543254/transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Language:Python0 0 00
a32543254/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Language:Python00