/Curator-Unlabeled-Image-Search-Guide

A guide for SpaceML’s machine learning pipeline

Primary LanguageJupyter Notebook

Curator, the guide 🌎

This is a guide for SpaceML’s machine learning pipeline that has seven components which are summarized below. Each program serves a different role in the pipeline from downloading satellite images and labeling images to training a machine learning model, improving an existing model and doing image similarity search. These programs can be used altogether but you can also utilize just one of them or a few of them according to your needs. Throughout this guide, we will showcase a few ways to combine this pipeline.

 

Program description & guide

A tool for downloading Earth images. You can download NASA satellite imagery of certain areas and certain time periods that you designate. It is useful to build an Earth image dataset.

Self-supervised learning program for training a machine learning model with fewer labeled data. You can train an encoder with unlabeled data and train a classifier with less labeled data compared to supervised learning.

Reverse image search app. Once you have a dataset and a model trained on the dataset, Image Similarity Search can calculate similarities between images in the dataset and show you similar images within the dataset to an image you pick. This can be used for a sanity check to make sure your model is trained well.

‘Image Similarity Search’ app works well with up to 3 million images. For the scalable image similarity search with bigger dataset, we used Index & Search (GCP), which utilizes Google Cloud Platform. To begin with, we saved the dataset and model we got from GIBS Downloader and Self-Supervised Learner on Google Cloud Storage Bucket. Then we had ①Index API and ②Search API. With Index API, we generated embeddings, an indexer file and a metadata file in Google Compute Engine VM. NVIDA DALI and FAISS were used to make the process more efficient. Then we deployed the Search API, which was built using FastAPI for minimal latency, to Google App Engine for the live image similarity search. Google Cloud Functions helped with easy and smooth usage of GCP throughout the process. To get a glimpse of how Index API works, check out this sample notebook

GUI based image labeling program. You can easily label images by swiping right/left, clicking accept/reject, or pressing the right/left arrow key on the keyboard. Multiple people can use Swipe Labeler at the same time without overwriting labels so you can enjoy speedy labeling with your teammates.

A program designed to better your model in an efficient manner. Once you have a trained model, Active Labeler will pick out images that the model has the most difficulty with. Then you’ll label those images through Swipe Labeler and retrain the model with the newly labeled images so that the model can overcome its weakness.

A chrome extension for finding similar images in the NASA Worldview website. Take a snapshot of a particular scene in a satellite image on the website. Then our extension will show you similar satellite images to the chosen image.

 

Combination guide

 

Required dataset format

Self-Supervised Learner, Image Similarity Search, Index & Search (GCP) and Active Labeler require a dataset to be organized in PyTorch ImageFolder format like this:

/Dataset
    /Class 1
        Image1.png
        Image2.png
    /Class 2
        Image3.png
        Image4.png

UC Merced Land Use dataset, which is used in some of our guide notebooks, is a good example:

/UCMerced_LandUse
    /Images
        /agricultural
            agricultural00.tif
            agricultural01.tif
            ...
        /airplane
            airplane00.tif
            airplane01.tif
            ...
        /...

In case there are no labels, you can organize images like this:

/Dataset
    /Unlabelled
        Image1.png
        Image2.png
        Image3.png

Citation

If you find Curator useful in your research, please consider citing the github code for this tool:

@code{
  title={Curator: A No-Code, Self-Supervised Learning and Active Labeling Tool to Create Labeled Image Datasets from Petabyte-Scale Imagery,
},
  url={https://github.com/spaceml-org/Curator-Unlabeled-Image-Search-Guide},
  year={2021}
}