ZCMax's Stars
RVC-Boss/GPT-SoVITS
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
openai/CLIP
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
meta-llama/llama3
The official Meta Llama 3 GitHub site
hpcaitech/Open-Sora
Open-Sora: Democratizing Efficient Video Production for All
dair-ai/ml-visuals
🎨 ML Visuals contains figures and templates which you can reuse and customize to improve your scientific writing.
naklecha/llama3-from-scratch
llama3 implementation one matrix multiplication at a time
apple/ml-ferret
rerun-io/rerun
Visualize streams of multimodal data. Fast, easy to use, and simple to integrate. Built in Rust using egui.
dvlab-research/MGM
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
yunlong10/Awesome-LLMs-for-Video-Understanding
🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
ActiveVisionLab/Awesome-LLM-3D
Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources
mbzuai-oryx/LLaVA-pp
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
PKU-YuanGroup/Chat-UniVi
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
LLaVA-VL/LLaVA-Plus-Codebase
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
LLaVA-VL/LLaVA-NeXT
open-compass/VLMEvalKit
Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 50+ HF models, 20+ benchmarks
magic-research/PLLaVA
Official repository for the paper PLLaVA
melon/qingwu-zimu
青梧字幕是一款基于whisper的AI字幕提取工具
EPFL-VILAB/omnidata
A Scalable Pipeline for Making Steerable Multi-Task Mid-Level Vision Datasets from 3D Scans [ICCV 2021]
dvlab-research/Stratified-Transformer
Stratified Transformer for 3D Point Cloud Segmentation (CVPR 2022)
mbanani/probe3d
[CVPR 2024] Probing the 3D Awareness of Visual Foundation Models
UMass-Foundation-Model/3D-VLA
Source codes for "3D-VLA: A 3D Vision-Language-Action Generative World Model"
facebookresearch/open-eqa
OpenEQA Embodied Question Answering in the Era of Foundation Models
sshh12/multi_token
Embed arbitrary modalities (images, audio, documents, etc) into large language models.
scene-verse/SceneVerse
behavior-vision-suite/behavior-vision-suite.github.io
remyxai/VQASynth
Compose multimodal datasets 🎹
zhouxian/act3d-chained-diffuser
A unified architecture for multimodal multi-task robotic policy learning.
xuxw98/Online3D
[CVPR 2024] Memory-based Adapters for Online 3D Scene Perception
joyhsu0504/NS3D