CC3M-Helper, how to prepare CC3M dataset for gill
Please download Training split
and Validation split
via Conceptual Captions
Note: there is another tool CC3M auto download, but seems not works cos the data pair structure is different.
ref: img2dataset
pip install img2dataset
apt install sed
sed -i '1s/^/caption\turl\n/' Train_GCC-training.tsv
sed -i '1s/^/caption\turl\n/' Validation_GCC-1.1.0-Validation.tsv
ref : NExT-GPT
# Make a dir
mkdir cc3m
# Download training image
img2dataset --url_list Train_GCC-training.tsv --input_format "tsv" --url_col "url" --caption_col "caption" --output_format webdataset --output_folder cc3m/training --processes_count 16 --thread_count 64 --image_size 256 --enable_wandb True
# Download validation image
img2dataset --url_list Validation_GCC-1.1.0-Validation.tsv --input_format "tsv" --url_col "url" --caption_col "caption" --output_format webdataset --output_folder cc3m/validation --processes_count 16 --thread_count 64 --image_size 256 --enable_wandb True
**Note that: **
url_list
A file with the list of url of images to download. It can be a folder of such files. (required)image_size
The size to resize image to (default 256)output_folder
The path to the output folder. (default "images")processes_count
The number of processes used for downloading the pictures. This is important to be high for performance. (default 1)thread_count
The number of threads used for downloading the pictures. This is important to be high for performance. (default 256)output_format
decides how to save pictures (default files)files saves
as a set of subfolder containing pictureswebdataset
saves as tars containing pictures- ...
url_col
the name of the url column for parquet and csv (default url)caption_col
the name of the caption column for parquet and csv (default None)enable_wandb
whether to enable wandb logging (default False)
After these two commands, you will have 331 .tar files and 2 .tar, benchmark you could find here cc3m download benchmark
Define a shell script
vi untar.sh
type in,
for file in *.tar; do
tar -xvf "$file"
done
give executive permission,
chmod 777 untar.sh
decompress training and validation image dataset,
cp untar.sh cc3m/training
cd cc3m/training
./untar.sh
cp untar.sh cc3m/validation
cd ../cc3m/validation
./untar.sh
vi gen_train_val_tsv.py
, type in :
import json
import os
from tqdm import tqdm
def process_json_files(directory, output_file):
# 检查输出文件是否已存在,如果存在则删除
# Check if the output file exists, and delete it if it does
if os.path.exists(output_file):
os.remove(output_file)
# 创建并写入列标题
# Create and write the column headers
with open(output_file, 'a') as out_file:
out_file.write('caption\timage\n')
# 遍历指定目录下的所有文件
# Iterate over all files in the specified directory
for filename in tqdm(os.listdir(directory),desc="Parsing dataset..."):
if filename.endswith('.json'):
# 构建完整的文件路径
# Construct the full file path
filepath = os.path.join(directory, filename)
# 打开并读取 JSON 文件
# Open and read the JSON file
with open(filepath, 'r') as file:
try:
data = json.load(file)
except json.JSONDecodeError:
continue
# 检查 status 字段是否为 'success'
# Check if the status field is 'success'
if data.get('status') == 'success':
# 提取 caption 和 key 字段的值
# Extract the values of the caption and key fields
caption = data.get('caption', '')
key = data.get('key', '') + '.jpg'
# 将提取的数据写入到输出文件
# Write the extracted data to the output file
with open(output_file, 'a') as out_file:
out_file.write(f"{caption}\t{key}\n")
# 调用函数,指定目录和输出文件的路径
# Call the function, specifying the directory and output file paths
process_json_files('path/to/your/train/json/files', 'cc3m_train.tsv')
process_json_files('path/to/your/val/json/files', 'cc3m_val.tsv')
, run the Python file:
python gen_train_val_tsv.py
After this, please move these two files (cc3m_train.tsv
and cc3m_val.tsv
) to your gill home path in gill/datasets to replace existing examples.
Then go back to gill - Precomputing Text Embeddings