A subset of YFCC100M. Tools, checking scripts and links of web drive to download datasets.
We followed the dataset preparation process of DeCLIP here.
-
First, Download DeCLIP's YFCC15M label file 'yfcc15m_clean_open_data.json' at Google Driver.
-
Extract the URL from the JSON file and split it into several URL list files for download using split_download_task.py.
-
Crawl the image by the URL dirctely using auto_download.bat (Here, we use Wget, you may need to install that). The bat file is for Windows, and you may need to rewrite a shell file if using Linux. Or, simply download from the links below!
- You can stop the process and start over afterward if something is wrong. Wget will skip the downloaded files and clean log files.
- The error will be recorded in log files. Before re-start the download, it is recommended to run clean_err_file_from_logs.py to filter and delete the wrong files.
-
Check the downloaded images using check_images.py.
Dataset infos:
- The dataset should contains 15,388,848 images.
- We managed to crawl 15,061,747 of them.
- Total space occupied: 867.73G.
Web Drive links:
- 📂split_1✅
- 📂split_2✅
- 📂split_3✅
- 📂split_4✅
- 📂split_5✅
- 📂split_6✅
- 📂split_7✅
- 📂split_8✅
- 📂split_9✅
- 📂split_10✅
- 📂split_11✅
- 📂split_12✅
- 📂split_13✅
- 📂split_14✅
- 📂split_15✅
- 📂split_16✅
If the link fails, please leave a message in the issue.
2024-11-13 update: You may use the bypy tools to download the files from Baidu Yun Web Drive.