A subset of Google's ConceptualCaptions(3M) dataset which include 940k image-text samples.
The data was from Google's ConceptualCaptions(CC) . 940k images-text pairs were selected from the original CC dataset, the caption-image data saved in clean_train.tsv , the image IDs saved in clean_trainImages.txt . Then , we use Hugging Face's pipelines to generate additional captions for the images, e.g.
Check [ModelName].tsv & [ModelName].txt to get caption-image data and image IDs . The LLM-generate-captions were checked and adjusted , we try to make it clean (by removing some special symbols or tokens that have no training value such as dates and names) . Finally , we make the "ConceptualCaptions-940k" dataset for image caption or related multi-modal AI task training here .
The dataset is now availible for downloading in Kaggle : conceptualcaptions-940ksubset . Before downloading , at least 68GB space is required .
- Download
imgs.7z.001
~imgs.7z.092
and unzipimgs.7z.001
(recommand using Bandizip) , get image files. - Download image list TXT file and its captions TSV file , for example ,
blip.txt
andblip.txv
. - Load image list get all the image file names , find the caption in TSV file , find the image in
imgs/
.
Coming soon .