bpiyush
1st year DPhil, VGG, Oxford. Past: MSc in AI from UvA | Research @ Wadhwani AI | B.S. in Mathematics @ IIT Kanpur
University of OxfordOxford
bpiyush's Stars
meta-llama/llama-recipes
Scripts for fine-tuning Meta Llama with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama for WhatsApp & Messenger.
salesforce/LAVIS
LAVIS - A One-stop Library for Language-Vision Intelligence
cambrian-mllm/cambrian
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
google-gemini/generative-ai-python
The official Python library for the Google Gemini API
facebookresearch/TimeSformer
The official pytorch implementation of our paper "Is Space-Time Attention All You Need for Video Understanding?"
FoundationVision/LlamaGen
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
DAMO-NLP-SG/VideoLLaMA2
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
rhymes-ai/Aria
Codebase for Aria - an Open Multimodal Native MoE
PKU-YuanGroup/LanguageBind
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
ylsung/Ladder-Side-Tuning
PyTorch codes for "LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning"
ziqipang/LM4VisualEncoding
[ICLR 2024 (Spotlight)] "Frozen Transformers in Language Models are Effective Visual Encoder Layers"
zhangfaen/finetune-Qwen2-VL
reka-ai/reka-vibe-eval
Multimodal language model benchmark, featuring challenging examples
joaanna/something_else
Code repository for the paper: 'Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks'
epic-kitchens/epic-kitchens-100-annotations
:plate_with_cutlery: Annotations for the public release of the EPIC-KITCHENS-100 dataset
see2sound/see2sound
Official code for SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound
yukw777/VideoBLIP
Supercharged BLIP-2 that can handle videos
dylran/crowddiff
NeeluMadan/ViFM_Survey
Foundation Models for Video Understanding: A Survey
songrise/CLIP-Count
[ACM MM23] CLIP-Count: Towards Text-Guided Zero-Shot Object Counting
wenhuchen/Time-Sensitive-QA
Code and Data for NeurIPS2021 Paper "A Dataset for Answering Time-Sensitive Questions"
park-jungin/DualPath
ninatu/howtocaption
Official implementation of "HowToCaption: Prompting LLMs to Transform Video Annotations at Scale." ECCV 2024
Yiming-M/CLIP-EBC
The official implementation of the crowd counting model CLIP-EBC.
alibaba-mmai-research/DiST
ICCV2023: Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning
wlin-at/MAXI
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge (ICCV 2023)
leexinhao/ZeroI2V
Official implementation of "ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video" (ECCV2024)
zhaochen0110/Timo
Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)
eisneim/clip-vip_video_search
showing how to use CLIP-Vip to do video search
Pshubham1012/Classification-approach