magic-research/PLLaVA

Dataset Resquest

Opened this issue · 2 comments

Great Work! I was hoping to quickly get the dataset used and try to train your model, but the number of datasets is a bit of a mess, could you please provide a consolidated training dataset? Thank you very much!

i am only sharing subset of the data used , try the similar way for other datasets directly from huggingface. Disclaimer: this is not the only way you can download, just sharing one of the ways you can download.

For image subset use below - :
curl -X GET
-H "Authorization: Bearer $HF_TOKEN"
"https://datasets-server.huggingface.co/rows?dataset=OpenGVLab%2FVideoChat2-IT&config=image_caption&split=textcaps&offset=0&length=100"

for video subset use below -:
curl -X GET
-H "Authorization: Bearer $HF_TOKEN"
"https://datasets-server.huggingface.co/rows?dataset=OpenGVLab%2FVideoChat2-IT&config=video_conversation&split=videochat2&offset=0&length=100"

Thanks for your attention. But I could find in the provided link just the json files, how could I get a filtered raw video dataset rather than downloading all the video datasets that the data.md mentioned. It's too memory consuming.