Easy Downloading of Youtube-StoryBoard-1B

Installation

bash scripts/install.sh

Preparing Youtube-Temporal-1B dataset

Download Youtube ids CSV of Youtube-Temporal-1B:

wget https://storage.googleapis.com/merlot/yttemporal1b/yttemporal1b_ids_train.csv
wget https://storage.googleapis.com/merlot/yttemporal1b/yttemporal1b_ids_val.csv

Download

Google Drive Bucket usage reference:

Convert to url list:

# split train urls into 10 parts and the url lists are storaged as: "yt1b_urls_train_0.txt", "yt1b_urls_train_1.txt", ...
python url_extraction.py -i dataset/yt-1b/yttemporal1b_ids_train.csv -o dataset/yt-1b/yt1b_urls_train.txt -p 10

# acquire val urls and the url lists are storaged as: "yt1b_url_val.txt"
python url_extraction.py -i dataset/yt-1b/yttemporal1b_ids_val.csv -o dataset/yt-1b/yt1b_urls_val.txt

Download video (.webp format) from urls:

bash scripts/download.sh dataset/yt-1b/yt1b_urls_train_{i}.txt # i from 0 to 9 according to your part number

bash scripts/download.sh dataset/yt-1b/yt1b_urls_val.txt

Prepareing subtitles files

Conduct script commands below:

# It is not recommended to open too many workers. Because the restriction of Youtube.
python get_transcript.py -i dataset/yt-1b/yt1b_urls_train_{i}.txt -o dataset/yt-1b/subtitles/train_{i} -w 8 # i from 0 to 9 according to your part number

python get_transcript.py -i dataset/yt-1b/yt1b_urls_val.txt -o dataset/yt-1b/subtitles/val -w 8

Generating subtitle file list

Processing data to video-text interleaved samples

Downloading Yoube-Temporal-1B label files:

# for folds i between 0 and 1023.
wget f'gs://merlot/yttemporal1b/train_annotations/yttemporal1b_train_{i:04d}of1024.jsonl.gz'
wget https://storage.googleapis.com/merlot/yttemporal1b/yttemporal1b_val_0000of0001.jsonl.gz

clownrat6/Youtube-Storyboard-1B

Easy Downloading of Youtube-StoryBoard-1B

Installation

Preparing Youtube-Temporal-1B dataset

Prepareing subtitles files

Processing data to video-text interleaved samples