Defense System against backdoor attacks on DNN

backdoor detector for BadNets trained on the YouTube Face dataset

How should you run the project

https://drive.google.com/drive/folders/1mf9UHHPq6tg8kZGlFTrFpCJfkg4zhWiB?usp=sharing

Add this google drive folder to your Drive and follow the Notebook snippets, In the google Drive folder you can check results in

results folder
repaired-networks folder

There are similar kinda folders (results_we_got and repaired-networks_we_got) in which you can see the results we got..

Introduction

The project is about detecting the backdoor attacks via input filters, neuron pruning and unlearning. So with the trained DNN model we have to find if there is any input trigger that would produce misleading classifications when trigger is added to input i.e(adversarial images)

What is this backdoor attack ?

To know this first we have to know what doesn't fall into this category,

It is not image specific modification (not Adversarial attack)
It isn't adversarial poisoning (where an incorrect label assosiation is done at training time or modifications on a trained model)

Thie Backdoor attack is where unexpected results will happen when a trigger is added to input. So if there is no trigger then this model is perfectly fine.

Bad Net: generated by training the model with the adversarial images and actual images, which gives 99% success rate. One other approach is Troajan Attack (latest one) is far more efficient and requires less data.

Now about Defense System aganist Backdoors

Part 1 The given attack model:

The given model is backdoored DNN and it only reveals trigger(collection of pixels and its associated colors) when it's used to predict (stealth)

Part 2 What we are going to accomplish:

Detecting backdoor and label it as separate class.
Identifying the trigger used
Lastly we gonna make Backdoor DNN right

How we're going to detect backdoors

First we find the minimal trigger to misclassify all labels into this target label
We're gonna do that to all labels and then we use outllier detection to find the real trigger, so the real trigger is very small compared to others.
Now as we have found which neurons get activated by the trigger, we gonna remove the newrons that are related to the backdoor approach (Patching DNN via Neuron Pruning) OR We can unlearn the neurons by adding reveresed trigger.

shuklashwin/CSAW-HackML-using-Neural-Cleanse