airsplay/vimpac

Visual Token of HowTo100M

zhengsipeng opened this issue · 3 comments

Hi, do you transform the raw videos of HTM datasets into visual tokens during the pre-training? And how large of the total size of its visual tokens? Since HTM takes 12T space, I'm curious about the size of its visual tokens.

We pre-extracted the tokens and used them during pre-training. The pre-extraction script is provided here: video2token.

I do not have the exact number of disk space for now. It should take 100~200G for saving all the tokens since the original video is largely compressed.

We pre-extracted the tokens and used them during pre-training. The pre-extraction script is provided here: video2token.

I do not have the exact number of disk space for now. It should take 100~200G for saving all the tokens since the original video is largely compressed.

We pre-extracted the tokens and used them during pre-training. The pre-extraction script is provided here: video2token.

I do not have the exact number of disk space for now. It should take 100~200G for saving all the tokens since the original video is largely compressed.

Hi, Can you privode data processing code for HowTo100M Pretraining? It seems a bit different from datasets?

Hi, is there code for HowTo100M video process?
Because it seems that the video2token only provide the process code for downstream dataset