google-research-datasets/conceptual-12m

The overlap between CC3m and CC12m

weiyx16 opened this issue · 2 comments

Really thanks for your excellent work!
I have a small question about the overlap between CC3m and CC12m dataset. From my perspective, the CC12m dataset is a expansion of CC3m, so most of the images in CC3m should be included in CC12m. But after I downloaded both tsv files and compared the urls between them, I only found about 63k urls of CC12m which also appear in CC3m dataset. Is this the expectation? or if I made anything wrong?
Any help will be extremely grateful. And I believe this dataset will contribute something really interesting to this area.

Thank you for interest and comments. The intersection of around 63k urls is expected and CC12M is not an expanded CC3M. We have added our full response to this question in our FAQs (#3). Hope this helps.

Thank you! It really helps