dvlab-research/LLaMA-VID
Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
PythonApache-2.0
Issues
- 7
About ZERO3
#75 opened by xxtars - 0
training loss in stage-1
#88 opened by Nastu-Ho - 0
code details
#87 opened by Nastu-Ho - 0
Extract context relevancy
#86 opened by IgnacioSan22 - 0
KeyError: 'LlavaConfig'
#85 opened by skyol99 - 0
- 0
About the WebVid dataset
#83 opened by szbcasia - 1
why not use LoRA for tunning Vicuna?
#72 opened by dragen1860 - 0
Confusion in pre-process images for long video
#77 opened by zhuqiangLu - 0
- 0
- 5
Zero-3 offload support
#60 opened by XenonLamb - 1
About the json in stage2 and stage3
#79 opened by liziming5353 - 0
Questions about Text Decoder and Text Query
#80 opened by SeuXiao - 0
about the context length for long video
#78 opened by zhuqiangLu - 1
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)
#76 opened by daocodedao - 1
- 1
- 1
Multi-image inference
#71 opened by g-h-chen - 1
- 1
- 1
error: llava key
#64 opened by menahem-borges-rodrigues - 2
Sharing training loss
#59 opened by Deaddawn - 1
Computation costs for each stage?
#70 opened by Becomebright - 2
How to change default path for model_zoo
#67 opened by sykuann - 1
Questions about the subtitles.
#66 opened by Yxxxb - 2
- 1
Long Video dataset
#61 opened by eslambakr - 2
flash-attn
#65 opened by ismailukman - 1
About evaluation on vqav2 dataset
#63 opened by liziming5353 - 5
Incomplete evaluation on MSVD-QA dataset.
#52 opened by XenonLamb - 3
About text encoder
#51 opened by liziming5353 - 3
MSVD ACC decrease after stage3
#58 opened by Deaddawn - 1
自定义长视频完全跑不了
#54 opened by TotoroDHL - 3
- 3
- 1
is eva_vit_g.pth trained by yourself?
#56 opened by Deaddawn - 1
why stage 1 and 2 use differenct ` --version plain_guided ` ` --version imgsp_v1 ` parameters?
#55 opened by dragen1860 - 2
Enquiry on Download Permission
#53 opened by HenryHZY - 2
A question in stage3
#45 opened by liziming5353 - 3
two types of tokenizer?
#43 opened by dragen1860 - 1
multiple json for training?
#39 opened by dragen1860 - 4
- 2
Long Video CLI wrong
#48 opened by QiSu77 - 1
is the LLM weight trainable during stage1-2-3?
#49 opened by dragen1860 - 1
- 1
- 1
stage 2: freezing the visual encoder?
#44 opened by dragen1860 - 1
you build `build_vision_tower` twice?
#42 opened by dragen1860 - 1
what does `lazy_preprocess` mean?
#41 opened by dragen1860