Fully supervised approaches need large, densely annotated datasets. Only hospitals that can afford to collect large annotated datasets can utilize these approaches to aid their physicians. The project goal is to utilize self-supervised and semi-supervised learning approaches to significantly reduce the need for fully labelled data. In this repo, you will find the project source code, along with training notebooks, and the final TensorFlow 2 saved model used to develop the web application for detecting Pediatric Pneumonia from chest X-rays.
The semi/self-supervised learning framework used in the project comprises of three stages:
- Self-supervised pretraining
- Supervised fine-tuning with active-learning
- Knowledge distillation using unlabeled data
Refer to Google reserach team's paper (SimCLRv2 - Big Self-Supervised Models are Strong Semi-Supervised Learners) for more details regarding the framework used.
The training notebooks for Stage 1, 2, and 3 can be found in the notebooks folder. Notebooks for Selective Labeling (active-learning) using Entropy or Augmentations policies can be found in the Active_Learn folder. We also evaluated another Semi-Supervised Learning approach called FixMatch. Benchmarks for Fully-Supervised Learning can be found in the FSL_Benchmarks folder. The code for Data Preprocessing can be found in Data_Preparation.
Labels | Stage 1 (Self-Supervised) |
---|---|
No labels used | 99.99% |
Contrastive Accuracy is a measure of how invariant the model's predictions are when tested against image augmentations.
Labels | FSL (Benchmark) | Stage 2 (Finetuning) | Stage 3 (Distillation) |
---|---|---|---|
1% | 85.2% | 94.5% | 96.3% |
2% | 85.1% | 96.8% | 97.6% |
5% | 86.0% | 97.1% | 98.1% |
100% | 98.9% | N/A | N/A |
Despite needing only a small fraction of labels, our Stage 2 and Stage 3 models were able to acheive test accuracies that are comparable to a 100% labelled Fully-Supervised (FSL) model. Refer to the Project Report and the Final Presentation for a more detailed discussion and findings.
Your can run the app locally if you have Docker installed. First, clone this repo:
git clone https://github.com/TeamSemiSuperCV/semi-super
Navigate to the webapp directory of the repo:
cd semi-super/webapp
Build the container image using the docker build
command (will take few minutes):
docker build -t semi-super .
Start the container using the docker run
command, specifying the name of the image we just created:
docker run -dp 8080:8080 semi-super
After a few seconds, open your web browser to http://localhost:8080. You should see the app.
We took the SimCLR framework code from Google Research and heavily modified it for the purposes of this project. We enhanced the knowledge distillation feature along with several other changes to make it perform better with our dataset. With these changes and improvements, knowledge distillation can be performed on the Google Cloud TPU infrastructure, which reduces training time significantly.