This repository is a paper digest of Transformer-related approaches in visual tracking tasks. Currently, tasks in this repository include Unified Tracking (UT), Single Object Tracking (SOT) and 3D Single Object Tracking (3DSOT). Note that some trackers involving a Non-Local attention mechanism are also collected. Papers are listed in alphabetical order of the first character.
Note: I find it hard to trace all tasks that are related to tracking, including Video Object Segmentation (VOS), Multiple Object Tracking (MOT), Video Instance Segmentation (VIS), Video Object Detection (VOD) and Object Re-Identification (ReID). Hence, I discard all other tracking tasks in a previous update. If you are interested, you can find plenty of collections in this archived verison. Besides, the most recent trend shows that different tracking tasks are coming to the same avenue.
- GRM (Generalized Relation Modeling for Transformer Tracking) [paper] [code]
- AiATrack (AiATrack: Attention in Attention for Transformer Visual Tracking) [paper] [code]
- (Talk) Discriminative Appearance-Based Tracking and Segmentation [video], Deep Visual Reasoning with Optimization-Based Network Modules [video]
- (Survey) Visual Object Tracking with Discriminative Filters and Siamese Networks: A Survey and Outlook [paper], Transformers in Single Object Tracking: An Experimental Survey [paper]
- (Library) PyTracking: Visual Tracking Library Based on PyTorch [code]
- (People) Martin Danelljan@ETH [web], Bin Yan@DLUT [web]
-
- Benefit from pre-trained vision Transformer models.
- Free from randomly initialized correlation modules.
- More discriminative target-specific feature extraction.
- Much faster inference and training convergence speed.
- Simple and generic one-branch tracking framework.
-
- 1st step 🐾 feature interaction inside the backbone.
- 2nd step 🐾 concatenation-based feature interaction.
- STARK [ICCV'21], SwinTrack [NeurIPS'22]
- 3rd step 🐾 joint feature extraction and interaction.
- 4th step 🐾 generalized feature interaction and relation modeling.
- GRM [CVPR'23]
- OmniTracker (OmniTracker: Unifying Object Tracking by Tracking-with-Detection) [paper] [
code] - UNINEXT (Universal Instance Perception as Object Discovery and Retrieval) [paper] [code]
- SAM-Track (Segment and Track Anything) [paper] [code]
- TAM (Track Anything: Segment Anything Meets Videos) [paper] [code]
- ARKitTrack (ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data) [paper] [code]
- DropTrack (DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks) [paper] [code]
- GRM (Generalized Relation Modeling for Transformer Tracking) [paper] [code]
- JointNLT (Joint Visual Grounding and Tracking with Natural Language Specification) [paper] [code]
- SeqTrack (SeqTrack: Sequence to Sequence Learning for Visual Object Tracking) [paper] [code]
- SwinV2 (Revealing the Dark Secrets of Masked Image Modeling) [paper] [code]
- VideoTrack (VideoTrack: Learning to Track Objects via Video Transformer) [
paper] [code] - ViPT (Visual Prompt Multi-Modal Tracking) [paper] [code]
- CTTrack (Compact Transformer Tracker with Correlative Masked Modeling) [paper] [code]
- GdaTFT (Global Dilated Attention and Target Focusing Network for Robust Tracking) [paper] [
code] - TATrack (Target-Aware Tracking with Long-term Context Attention) [paper] [code]
- ClimRT (Continuity-Aware Latent Interframe Information Mining for Reliable UAV Tracking) [paper] [code]
- SGDViT (SGDViT: Saliency-Guided Dynamic vision Transformer for UAV tracking) [paper] [code]
- CDT (Cascaded Denoising Transformer for UAV Nighttime Tracking) [paper] [code]
- FDNT (End-to-End Feature Decontaminated Network for UAV Tracking) [paper] [code]
- ScaleAwareDA (Scale-Aware Domain Adaptation for Robust UAV Tracking) [paper] [code]
- TRTrack (Boosting UAV Tracking With Voxel-Based Trajectory-Aware Pre-Training) [paper] [code]
- AMST2 (AMST2: Aggregated Multi-Level Spatial and Temporal Context-Based Transformer for Robust Aerial Tracking) [paper] [
code] - FAT (Transformer Tracker Based on Multi-level Residual Perception Structure) [paper] [
code] - MACFT (RGB-T Tracking Based on Mixed Attention) [paper] [
code] - MixViT (MixFormer: End-to-End Tracking with Iterative Mixed Attention) [paper] [code]
- ProFormer (RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning) [paper] [
code]
- CSWinTT (Transformer Tracking with Cyclic Shifting Window Attention) [paper] [code]
- GTELT (Global Tracking via Ensemble of Local Trackers) [paper] [code]
- MixFormer (MixFormer: End-to-End Tracking with Iterative Mixed Attention) [paper] [code]
- RBO (Ranking-Based Siamese Visual Tracking) [paper] [code]
- SBT (Correlation-Aware Deep Tracking) [paper] [code]
- STNet (Spiking Transformers for Event-Based Single Object Tracking) [paper] [code]
- TCTrack (TCTrack: Temporal Contexts for Aerial Tracking) [paper] [code]
- ToMP (Transforming Model Prediction for Tracking) [paper] [code]
- UDAT (Unsupervised Domain Adaptation for Nighttime Aerial Tracking) [paper] [code]
- SwinTrack (SwinTrack: A Simple and Strong Baseline for Transformer Tracking) [paper&review] [code]
- AiATrack (AiATrack: Attention in Attention for Transformer Visual Tracking) [paper] [code]
- CIA (Hierarchical Feature Embedding for Visual Tracking) [paper] [code]
- DMTracker (Learning Dual-Fused Modality-Aware Representations for RGBD Tracking) [paper] [code]
- HCAT (Efficient Visual Tracking via Hierarchical Cross-Attention Transformer) [paper] [code]
- OSTrack (Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework) [paper] [code]
- SimTrack (Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking) [paper] [code]
- VOT2022 (The Tenth Visual Object Tracking VOT2022 Challenge Results) [paper] [code]
- InMo (Learning Target-Aware Representation for Visual Tracking via Informative Interactions) [paper] [code]
- SparseTT (SparseTT: Visual Tracking with Sparse Transformers) [paper] [code]
- TAT (Temporal-Aware Siamese Tracker: Integrate Temporal Context for 3D Object Tracking) [paper] [code]
- HighlightNet (HighlightNet: Highlighting Low-Light Potential Features for Real-Time UAV Tracking) [paper] [code]
- LPAT (Local Perception-Aware Transformer for Aerial Tracking) [paper] [code]
- SiamSA (Siamese Object Tracking for Vision-Based UAM Approaching with Pairwise Scale-Channel Attention) [paper] [code]
- CEUTrack (Revisiting Color-Event based Tracking: A Unified Network, Dataset, and Metric) [paper] [code]
- FDT (Feature-Distilled Transformer for UAV Tracking) [
paper] [code] - ProContEXT (ProContEXT: Exploring Progressive Context Transformer for Tracking) [paper] [code]
- RAMAVT (On Deep Recurrent Reinforcement Learning for Active Visual Tracking of Space Noncooperative Objects) [paper] [code]
- SFTransT (Learning Spatial-Frequency Transformer for Visual Object Tracking) [paper] [code]
- SiamLA (Learning Localization-Aware Target Confidence for Siamese Visual Tracking) [paper] [
code] - SPT (RGBD1K: A Large-scale Dataset and Benchmark for RGB-D Object Tracking) [paper] [code]
- TaMOs (Beyond SOT: It's Time to Track Multiple Generic Objects at Once) [paper] [
code]
- SiamGAT (Graph Attention Tracking) [paper] [code]
- STMTrack (STMTrack: Template-Free Visual Tracking with Space-Time Memory Networks) [paper] [code]
- TMT (Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking) [paper] [code]
- TransT (Transformer Tracking) [paper] [code]
- AutoMatch (Learn to Match: Automatic Matching Network Design for Visual Tracking) [paper] [code]
- DTT (High-Performance Discriminative Tracking with Transformers) [paper] [code]
- DualTFR (Learning Tracking Representations via Dual-Branch Fully Transformer Networks) [paper] [code]
- HiFT (HiFT: Hierarchical Feature Transformer for Aerial Tracking) [paper] [code]
- SAMN (Learning Spatio-Appearance Memory Network for High-Performance Visual Tracking) [paper] [code]
- STARK (Learning Spatio-Temporal Transformer for Visual Tracking) [paper] [code]
- TransT-M (High-Performance Transformer Tracking) [paper] [code]
- VOT2021 (The Ninth Visual Object Tracking VOT2021 Challenge Results) [paper] [code]
- TAPL (TAPL: Dynamic Part-Based Visual Tracking via Attention-Guided Part Localization) [paper] [
code]
- TREG (Target Transformed Regression for Accurate Tracking) [paper] [code]
- TrTr (TrTr: Visual Tracking with Transformer) [paper] [code]
- VisEvent (VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows) [paper] [code]
- VTT (VTT: Long-Term Visual Tracking with Transformers) [paper] [
code]
- GLT-T (GLT-T: Global-Local Transformer Voting for 3D Single Object Tracking in Point Clouds) [paper] [code]
- OSP2B (OSP2B: One-Stage Point-to-Box Network for 3D Siamese Tracking) [paper] [
code]
- GLT-T++ (GLT-T++: Global-Local Transformer for 3D Siamese Tracking with Ranking Loss) [paper] [code]
- MBPTrack (MBPTrack: Improving 3D Point Cloud Tracking with Memory Networks and Box Priors) [paper] [
code] - MMF-Track (Multi-Modal Multi-Level Fusion for 3D Single Object Tracking) [paper] [
code] - StreamTrack (Modeling Continuous Motion for 3D Point Cloud Object Tracking) [paper] [
code]
- CMT (CMT: Context-Matching-Guided Transformer for 3D Tracking in Point Clouds) [paper] [code]
- SpOT (SpOT: Spatiotemporal Modeling for 3D Object Tracking) [paper] [
code] - STNet (3D Siamese Transformer Network for Single Object Tracking on Point Clouds) [paper] [code]
- OST (OST: Efficient One-stream Network for 3D Single Object Tracking in Point Clouds) [paper] [
code] - PCET (Implicit and Efficient Point Cloud Completion for 3D Single Object Tracking) [paper] [
code] - PTTR++ (Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer) [paper] [code]
- RDT (Point Cloud Registration-Driven Robust Feature Matching for 3D Siamese Object Tracking) [paper] [
code]