vision-transformer
There are 1276 repositories under vision-transformer topic.
open-mmlab/mmdetection
OpenMMLab Detection Toolbox and Benchmark
lukas-blecher/LaTeX-OCR
pix2tex: Using a ViT to convert images of equations into LaTeX code.
NielsRogge/Transformers-Tutorials
This repository contains demos I made with the Transformers library by HuggingFace.
FoundationVision/VAR
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
adithya-s-k/omniparse
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
JingyunLiang/SwinIR
SwinIR: Image Restoration Using Swin Transformer (official repository)
cmhungsteve/Awesome-Transformer-Attention
An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites
huawei-noah/Efficient-AI-Backbones
Efficient AI Backbones including GhostNet, TNT and MLP, developed by Huawei Noah's Ark Lab.
open-mmlab/mmpretrain
OpenMMLab Pre-training Toolbox and Benchmark
google-research/scenic
Scenic: A Jax Library for Computer Vision Research and Beyond
towhee-io/towhee
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
mit-han-lab/efficientvit
Efficient vision foundation models for high-resolution generation and perception.
InternLM/InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
baaivision/EVA
EVA Series: Visual Representation Fantasies from BAAI
OpenGVLab/InternVideo
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
hila-chefer/Transformer-Explainability
[CVPR 2021] Official PyTorch implementation for Transformer Interpretability Beyond Attention Visualization, a novel method to visualize classifications by Transformer based networks.
alibaba/EasyCV
An all-in-one toolkit for computer vision
microsoft/Cream
This is a collection of our NAS and Vision Transformer work.
ViTAE-Transformer/ViTPose
The official repo for [NeurIPS'22] "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation" and [TPAMI'23] "ViTPose++: Vision Transformer for Generic Body Pose Estimation"
NVlabs/MambaVision
[CVPR 2025] Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone
Blaizzy/mlx-vlm
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
MCG-NJU/VideoMAE
[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
JingyunLiang/VRT
VRT: A Video Restoration Transformer (official repository)
czczup/ViT-Adapter
[ICLR 2023 Spotlight] Vision Transformer Adapter for Dense Predictions
emcf/thepipe
Get clean data from tricky documents, powered by vision-language models ⚡
pprp/awesome-attention-mechanism-in-cv
Awesome List of Attention Modules and Plug&Play Modules in Computer Vision
yitu-opensource/T2T-ViT
ICCV2021, Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
NVlabs/VoxFormer
Official PyTorch implementation of VoxFormer [CVPR 2023 Highlight]
uncbiag/Awesome-Foundation-Models
A curated list of foundation models for vision and language tasks
OFA-Sys/ONE-PEACE
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
jacobgil/vit-explain
Explainability for Vision Transformers
WangLibo1995/GeoSeg
UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery, ISPRS. Also, including other vision transformers and CNNs for satellite, aerial image and UAV image segmentation.
sithu31296/semantic-segmentation
SOTA Semantic Segmentation Models in PyTorch
LeapLabTHU/DAT
Repository of Vision Transformer with Deformable Attention (CVPR2022) and DAT++: Spatially Dynamic Vision Transformerwith Deformable Attention
hustvl/YOLOS
[NeurIPS 2021] You Only Look at One Sequence
NVlabs/FasterViT
[ICLR 2024] Official PyTorch implementation of FasterViT: Fast Vision Transformers with Hierarchical Attention