A dataset of +500,000 mineral images with labels taken from [mindat.org]. The dataset is in two CVS files. The first file img_url_list.csv
contains a list of the image URLs and the raw labels corresponding to the title of the image as seen on mindat.org. The second file img_url_list_converted.csv
contains the same image URLs with cleaned up labels and images with no labels removed. All scripts to generated these CVS files are provided alone with a script to download all the images. Generating the CVS file of images take around 10 hours and downloading all images requires around 24 hours with a good internet connection (>10mbps). All scripts are multi threaded.
To generate the URL list run make_url_list.py
and this will go through [mindat.org] and scrap all the images. This will save several hundred files in the img_urls
directory. After this run the concat_url_files
script to make one unified CVS file called img_url_list.csv
. Now you can run convert_img_url_list.py
to generate an image list with cleaned up labels. Right now this scrip just removes variation text. For example, many images are Quartz with variation of Capped Quartz, Chalcedony Quartz, etc. These extra labels get removed to just read Quartz. The script can be easily modified to clean the data in other ways though. The last script is download_images.py
that downloads all the images in the img_url_list_converted.csv
to a directory specified in the file. The total dataset is around 400G. Some images are extremely high resolution so the total space can be greatly reduce by resizing before saving however for my purposes I wanted the high resolution images and kept that in.