rohun-tripathi's Stars
openai/whisper
Robust Speech Recognition via Large-Scale Weak Supervision
suno-ai/bark
🔊 Text-Prompted Generative Audio Model
s3prl/s3prl
Self-Supervised Speech Pre-training and Representation Learning Toolkit
EvolvingLMMs-Lab/lmms-eval
Accelerating the development of large multimodal models (LMMs) with lmms-eval
EleutherAI/lm-evaluation-harness
A framework for few-shot evaluation of language models.
mayubo2333/VLMEvalKit
Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 50+ HF models, 20+ benchmarks
open-compass/VLMEvalKit
Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks
FuxiaoLiu/LRV-Instruction
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
NVlabs/VILA
VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
ayaka14732/jax-smi
JAX Synergistic Memory Inspector
google-deepmind/geckonum_benchmark_t2i
GeckoNum Benchmark for T2I Model Eval.
OpenGVLab/Ask-Anything
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
InternLM/InternLM-XComposer
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
mutonix/Vript
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Alpha-VLLM/Lumina-T2X
Lumina-T2X is a unified framework for Text to Any Modality Generation
hpcaitech/Open-Sora
Open-Sora: Democratizing Efficient Video Production for All
prometheus-eval/prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
prometheus-eval/prometheus-vision
[ACL 2024 Findings & ICLR 2024 WS] An Evaluator VLM that is open-source, offers reproducible evaluation, and inexpensive to use. Specifically designed for fine-grained evaluation on customized score rubric, Prometheus-Vision is a good alternative for human evaluation and GPT-4V evaluation.
GAP-LAB-CUHK-SZ/MVImgNet
CVPR2023 | MVImgNet: A Large-scale Dataset of Multi-view Images
NVlabs/RADIO
Official repository for "AM-RADIO: Reduce All Domains Into One"
cvdfoundation/google-landmark
Dataset with 5 million images depicting human-made and natural landmarks spanning 200 thousand classes.
cambridgeltl/visual-spatial-reasoning
[TACL'23] VSR: A probing benchmark for spatial undersranding of vision-language models.
mosaicml/diffusion
LLaVA-VL/LLaVA-NeXT
PKU-YuanGroup/Open-Sora-Plan
This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
microsoft/SoM
Set-of-Mark Prompting for GPT-4V and LMMs
SivanDoveh/TSVLC
Repository for the paper: Teaching Structured Vision & Language Concepts to Vision & Language Models
google-research-datasets/wit
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.
clip-italian/clip-italian
CLIP (Contrastive Language–Image Pre-training) for Italian