hj611's Stars
vikpe/vscode-theme-screenshots
Automate screenshots of Visual Studio Code themes.
abi/screenshot-to-code
Drop in a screenshot and convert it to clean code (HTML/Tailwind/React/Vue)
IMNearth/CoAT
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
njucckevin/SeeClick
The model, data and code for the visual GUI Agent SeeClick
OSU-NLP-Group/Mind2Web
[NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web"
THUDM/CogVLM
a state-of-the-art-level open visual language model | 多模态预训练模型
google-research-datasets/screen_annotation
The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations are in text format, and describe the UI elements present on the screen: their type, location, OCR text and a short description. It has been introduced in the paper `ScreenAI: A Vision-Language Model for UI and Infographics Understanding`.
google-research-datasets/screen_qa
ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.
openai/human-eval
Code for the paper "Evaluating Large Language Models Trained on Code"
meta-llama/codellama
Inference code for CodeLlama models
jadecxliu/CodeQA
Dataset and code for Findings of EMNLP'21 paper "CodeQA: A Question Answering Dataset for Source Code Comprehension".
likaixin2000/MMCode
[EMNLP 2024] Multi-modal code generation problems.
QwenLM/Qwen2.5
Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud.
InternLM/InternLM-XComposer
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
wangxiang1230/SSTAP
Code for our CVPR 2021 Paper "Self-Supervised Learning for Semi-Supervised Temporal Action Proposal".
lllyasviel/ControlNet
Let us control diffusion models!
Computer-Vision-in-the-Wild/CVinW_Readings
A collection of papers on the topic of ``Computer Vision in the Wild (CVinW)''
Yujun-Shi/DragDiffusion
[CVPR2024, Highlight] Official code for DragDiffusion
Jingkang50/OpenPSG
Benchmarking Panoptic Scene Graph Generation (PSG), ECCV'22
xfhelen/MMBench
An end-to-end benchmark suite of multi-modal DNN applications for system-architecture co-design
BradyFU/Awesome-Multimodal-Large-Language-Models
:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
sail-sg/EditAnything
Edit anything in images powered by segment-anything, ControlNet, StableDiffusion, etc. (ACM MM)
showlab/Image2Paragraph
[A toolbox for fun.] Transform Image into Unique Paragraph with ChatGPT, BLIP2, OFA, GRIT, Segment Anything, ControlNet.
ranjaykrishna/visual_genome_python_driver
A python wrapper for the Visual Genome API
jshilong/GPT4RoI
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
cvdfoundation/open-images-dataset
Open Images is a dataset of ~9 million images that have been annotated with image-level labels and bounding boxes spanning thousands of classes.
om-ai-lab/RS5M
RS5M: a large-scale vision language dataset for remote sensing [TGRS]
OpenGVLab/InternVideo
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
phellonchen/X-LLM
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
Vision-CAIR/MiniGPT-4
Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)