LinMu7177's Stars
songweige/rich-text-to-image
Rich-Text-to-Image Generation
microsoft/unilm
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
YuchenLiu98/COMM
Pytorch code for paper From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models
cmhungsteve/Awesome-Transformer-Attention
An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites
microsoft/SoM
Set-of-Mark Prompting for LMMs
mlfoundations/datacomp
DataComp: In search of the next generation of multimodal datasets
mlpc-ucsd/BLIVA
(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
baaivision/Emu
Emu Series: Generative Multimodal Models from BAAI
jshilong/GPT4RoI
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
facebookresearch/paco
This repo contains documentation and code needed to use PACO dataset: data loaders and training and evaluation scripts for objects, parts, and attributes prediction models, query evaluation scripts, and visualization notebooks.
linjieli222/VQA_ReGAT
Research Code for ICCV 2019 paper "Relation-aware Graph Attention Network for Visual Question Answering"
RachanaJayaram/Cross-Attention-VizWiz-VQA
A self-evident application of the VQA task is to design systems that aid blind people with sight reliant queries. The VizWiz VQA dataset originates from images and questions compiled by members of the visually impaired community and as such, highlights some of the challenges presented by this particular use case.
Cloud-CV/EvalAI
:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI
OpenGVLab/Multi-Modality-Arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
amazon-science/mm-cot
Official implementation for "Multimodal Chain-of-Thought Reasoning in Language Models" (stay tuned and more will be updated)
henghuiding/ReLA
[CVPR2023 Highlight] GRES: Generalized Referring Expression Segmentation
k1rezaei/Text-to-concept
UX-Decoder/Semantic-SAM
[ECCV 2024] Official implementation of the paper "Semantic-SAM: Segment and Recognize Anything at Any Granularity"
google-research/magvit
Official JAX implementation of MAGVIT: Masked Generative Video Transformer
hoya012/semantic-segmentation-tutorial-pytorch
A simple PyTorch codebase for semantic segmentation using Cityscapes.
daohu527/awesome-self-driving-car
An awesome list of self-driving cars
IDEA-Research/GroundingDINO
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
nish03/FFS
Code for CVPR 2023 Highlight paper "Normalizing Flow based Feature Synthesis for Outlier-Aware Object Detection"
rentainhe/TRAR-VQA
[ICCV 2021] Official implementation of the paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering"
haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
OptimalScale/DetGPT
luca-medeiros/lang-segment-anything
SAM with text prompt
MenghaoGuo/Awesome-Vision-Attentions
Summary of related papers on visual attention. Related code will be released based on Jittor gradually.
salesforce/LAVIS
LAVIS - A One-stop Library for Language-Vision Intelligence
IDEA-Research/Grounded-Segment-Anything
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything