Adversarial Sample Detection and Classification with Clustering

The class ClusteringAdvClassifier implements a SKLearn style classifier (with the fit, predict, and score methods implemented) which uses a combination of another trained model (provided by the user) and clustering algorithms to detect and classify adversarial samples. The basic principle behind the clustering algorithm is to detect high magnitude changes in the output of the provided model, indicated by outputs being in a different cluster than regular outputs of the same class. It determines which output cluster a sample's output is expected to fall into by also clustering the sample in the input space, and predicting the output based on the cluster that the input falls into. The training procedure works as follows:

Cluster the input to the classifier based on provided labels (applies a SVM with a RBF kernel to create a round boundary around the data)
Run the provided training samples through the provided model, which is assumed to be already trained.
Cluster the outputs of the provided model, using the provided label.

By training the clustering algorithm with the same labels for the input and output space, we can make an inference about where we expect the output of a sample to be clustered, based on which cluster the sample lands in in the input space. If the sample's output does not land in the output cluster predicted by the input cluster, we flag it as suspicious. For now, suspicious samples are classified simply by returning the classification given by the clustering algorithm rather than the one given by the model. Any sample not flagged as suspicious is classified by the model provided. The intuition behind this method is that the objective of an adversarial attack is to change the image pixels in such a way as to cause the underlying model to misclassify the image, with minimal changes to the pixels themselves. Therefore, we assume that adversarial images will have relatively little difference between themselves and natural/clean images, whereas the difference in model output will necessarily be at least large enough to cause the image to be misclassified.

In this repo

The main testing and visualization of the Adversarial Sample Detection and Classification model is done in the Adversarial_Sample_Detection_with_Clustering jupyter notebook. There are many other notebooks that have previously been used to create visualizations that helped motivate our solution, along with some other ideas we've looked at. They can largely be ignored for now and will likely be removed later. There is also a LayerwiseClustering class which doesn't fully function as a classifer right now, but was being used to create clusters at all major layers in the provided model. We were using this to visualize clusters at each layer, and it could theoretically be modified so that our classifier looks at more than just the input and output layers. The visualization of each layer can be seen in the View_Sample_At_Each_Layer jupyter notebook. model.py defines a small VGG network that we use for testing, and train_model.py is a script that we use for training and saving a model. train_model.py trains both a base network to use as the underlying model for the clustering adversarial classifier, and an adversarially trained version of the same model to use as a baseline of comparison with existing adversarial sample classification methods.

aranganath/Adversarial

Adversarial Sample Detection and Classification with Clustering

In this repo