CC3M-Helper, how to prepare CC3M dataset for gill

Download raw data in caption-url pair

Please download Training split and Validation split via Conceptual Captions

Note: there is another tool CC3M auto download, but seems not works cos the data pair structure is different.

Download raw train and val data

ref: img2dataset

Install img2dataset

pip install img2dataset

Add head to .tsv files

apt install sed

sed -i '1s/^/caption\turl\n/' Train_GCC-training.tsv
sed -i '1s/^/caption\turl\n/' Validation_GCC-1.1.0-Validation.tsv

Download image data

ref : NExT-GPT

# Make a dir
mkdir cc3m

# Download training image
img2dataset --url_list Train_GCC-training.tsv --input_format "tsv" --url_col "url" --caption_col "caption" --output_format webdataset --output_folder cc3m/training --processes_count 16 --thread_count 64 --image_size 256 --enable_wandb True

# Download validation image
img2dataset --url_list Validation_GCC-1.1.0-Validation.tsv --input_format "tsv" --url_col "url" --caption_col "caption" --output_format webdataset --output_folder cc3m/validation --processes_count 16 --thread_count 64 --image_size 256 --enable_wandb True

**Note that: **

  • url_list A file with the list of url of images to download. It can be a folder of such files. (required)
  • image_size The size to resize image to (default 256)
  • output_folder The path to the output folder. (default "images")
  • processes_count The number of processes used for downloading the pictures. This is important to be high for performance. (default 1)
  • thread_count The number of threads used for downloading the pictures. This is important to be high for performance. (default 256)
  • output_format decides how to save pictures (default files)
    • files saves as a set of subfolder containing pictures
    • webdataset saves as tars containing pictures
    • ...
  • url_col the name of the url column for parquet and csv (default url)
  • caption_col the name of the caption column for parquet and csv (default None)
  • enable_wandb whether to enable wandb logging (default False)

After these two commands, you will have 331 .tar files and 2 .tar, benchmark you could find here cc3m download benchmark

decompress all data

Define a shell script

vi untar.sh

type in,

for file in *.tar; do
	tar -xvf "$file" 
done

give executive permission,

chmod 777 untar.sh

decompress training and validation image dataset,

cp untar.sh cc3m/training
cd cc3m/training
./untar.sh

cp untar.sh cc3m/validation
cd ../cc3m/validation
./untar.sh 

Generate new .tsv files

Create a Python script

vi gen_train_val_tsv.py

, type in :

import json
import os
from tqdm import tqdm

def process_json_files(directory, output_file):
    # 检查输出文件是否已存在,如果存在则删除
    # Check if the output file exists, and delete it if it does
    if os.path.exists(output_file):
        os.remove(output_file)

    # 创建并写入列标题
    # Create and write the column headers
    with open(output_file, 'a') as out_file:
        out_file.write('caption\timage\n')

    # 遍历指定目录下的所有文件
    # Iterate over all files in the specified directory
    for filename in tqdm(os.listdir(directory),desc="Parsing dataset..."):
        if filename.endswith('.json'):
            # 构建完整的文件路径
            # Construct the full file path
            filepath = os.path.join(directory, filename)

            # 打开并读取 JSON 文件
            # Open and read the JSON file
            with open(filepath, 'r') as file:
                try:
                    data = json.load(file)
                except json.JSONDecodeError:
                    continue

                # 检查 status 字段是否为 'success'
                # Check if the status field is 'success'
                if data.get('status') == 'success':
                    # 提取 caption 和 key 字段的值
                    # Extract the values of the caption and key fields
                    caption = data.get('caption', '')
                    key = data.get('key', '') + '.jpg'

                    # 将提取的数据写入到输出文件
                    # Write the extracted data to the output file
                    with open(output_file, 'a') as out_file:
                        out_file.write(f"{caption}\t{key}\n")


# 调用函数,指定目录和输出文件的路径
# Call the function, specifying the directory and output file paths
process_json_files('path/to/your/train/json/files', 'cc3m_train.tsv')
process_json_files('path/to/your/val/json/files', 'cc3m_val.tsv')

, run the Python file:

python gen_train_val_tsv.py

After this, please move these two files (cc3m_train.tsv and cc3m_val.tsv) to your gill home path in gill/datasets to replace existing examples.

Then go back to gill - Precomputing Text Embeddings