YFCC15M_downloader

A subset of YFCC100M. Tools, checking scripts and links of web drive to download datasets.

We followed the dataset preparation process of DeCLIP here.

First, Download DeCLIP's YFCC15M label file 'yfcc15m_clean_open_data.json' at Google Driver.
Extract the URL from the JSON file and split it into several URL list files for download using split_download_task.py.
Crawl the image by the URL dirctely using auto_download.bat (Here, we use Wget, you may need to install that). The bat file is for Windows, and you may need to rewrite a shell file if using Linux. Or, simply download from the links below!
- You can stop the process and start over afterward if something is wrong. Wget will skip the downloaded files and clean log files.
- The error will be recorded in log files. Before re-start the download, it is recommended to run clean_err_file_from_logs.py to filter and delete the wrong files.
Check the downloaded images using check_images.py.

Dataset infos:

Web Drive links:

If the link fails, please leave a message in the issue.

2024-11-13 update: You may use the bypy tools to download the files from Baidu Yun Web Drive.

AdamRain/YFCC15M_downloader