semi-super: A Jupyter Notebook repository from sbhimire

Semi/Self-Supervised Learning on a Pediatric Pneumonia Dataset

Live Deployment! The deployed SemiSuperCV app predicts if a patient has pneumonia based on a chest X-ray. You can try the app at https://webapp-mcdo4dmd2a-uc.a.run.app. The first couple of runs might have high latency due to cold start times on Google Cloud Run.

About

Fully supervised approaches need large, densely annotated datasets. Only hospitals that can afford to collect large annotated datasets can utilize these approaches to aid their physicians. The project goal is to utilize self-supervised and semi-supervised learning approaches to significantly reduce the need for fully labeled data. In this repo, you will find the project source code, training notebooks, and the final TensorFlow 2 saved model used to develop the web application for detecting pediatric pneumonia from chest X-rays.

The semi-/self-supervised learning framework used in the project comprises three stages:

Self-supervised pretraining
Supervised fine-tuning
Knowledge distillation using unlabeled data

Refer to Google research team's paper “SimCLRv2 - Big Self-Supervised Models are Strong Semi-Supervised Learners” for more details regarding the framework used.

The training notebooks for Stage 1, 2, and 3 can be found in the notebooks folder. The project report and the final presentation slides can be found in the docs folder.

Resources

ML workflow

App demo

Follow the instructions below to run the app locally with Docker.

Once you have Docker installed, clone this repo

git clone https://github.com/TeamSemiSuperCV/semi-super

Navigate to the webapp directory of the repo.

cd semi-super/webapp

Now build the container image using the docker build command. It will take few minutes.

docker build -t semi-super .

Start your container using the docker run command and specify the name of the image we just created:

docker run -dp 8080:8080 semi-super

After a few seconds, open your web browser to http://localhost:8080. You should see our app.

Findings

Stage 1 - Contrastive Accuracy

Labels	Stage 1 (self-supervised)
No labels used	99.99%

Stage 2 and 3 - Test Accuracy Comparison

Labels	FSL¹	Stage 2 (finetuning)	Stage 3 (distillation)
1%	85.2%	94.5%	96.3%
2%	87.2%	96.8%	97.6%
5%	86.0%	97.1%	98.1%
100%	98.9%

Despite using only a fraction of labels, our Stage 2 and Stage 3 models achieved test accuracies comparable to a 100 percent fully supervised (FSL) model. Refer to the project report and the final presentation slides in the docs folder for a more detailed discussion and findings.

Acknowledgements

We have taken the SimCLR framework code from Google Research and heavily modified it for the purpose of this project. We have added a knowledge distillation feature along with several other changes. Following these changes and improvements, knowledge distillation can be performed on Google Cloud TPU cluster, which reduces training time significantly.

Fully-supervised model performance for benchmarking purposes. ↩

sbhimire/semi-super