Semi-Supervised-Learning-on-Fashion-MNIST-dataset

Contributors

Avirup Das
Ayush Thada

Introduction

Clustering is usually used for problems related to unsupervised learning but we will use it as a pre-processing tool for semi-supervised learning. If we only have a few labels, we could perform clustering and propagate the labels to all the instances (or to the closest instances decided by percentile) in the same cluster. This technique can greatly increase the number of labels available for a subsequent supervised learning algorithm, and thus improve its performance.

Data Set

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Fashion-MNIST was intended to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.

Methodology

We will use logistic regression and a 7 layer deep neural network to classify the Fashion MNIST dataset. For each of these models we would first train the model with the whole data set (60000 instances) and test it for 10000 instances. The accuracy will be our baseline and we would try to improve upon that.

Model	Baseline Accuracy
Logistic Regression	84.1%
Neural Network	89.68%

The Fashion MNIST dataset contains images of dimension $28\times28$, so as a pre-processing step for Logistic regression, we flatten the image matrix into a vector. We have also pickled our models so they can be re-used without retraining as we observed that the training time was very high since we were also experimenting with different values of the number of clusters. These files will be provided along with the code.

First we have created a pipeline that will cluster the training set into 100, 200 and 300 clusters and replace the images with their distances to these clusters, then apply a Logistic Regression model.

Cluster	Accuracy
100	82.62%
200	83.89%
300	84.56%

Then we took random n-labelled instances (for n=500, 1000, 2000) and check how our models perform in terms of accuracy.

Cluster	Logistic Regression	Neural Network
500	78.52%	73.51%
1000	79.24%	76%
2000	80.89%	78.53%

We can see that the neural network does not perform very well compared to Logistic Regression in this case since we are using a very small amount of data as training set. On the other hand, the neural network takes much lesser time to get trained. We also see that the models perform better for large cluster sizes suggesting that for further experimentation we should work with large cluster sizes.

Next, we cluster the instances into 2000 clusters and use the centroids to train our model.

Cluster	Logistic Regression	Neural Network
500	-	75.5%
1000	-	76.36%
2000	81.46%	80.24%

Again we see that the accuracy increases when the number of clusters is increased. Now we propagate the labels of these representative points (centroids) to all the instances under the same cluster and run our models.

Cluster	Logistic Regression	Neural Network
500	-	76.49%
1000	-	77.80%
2000	81.21%	79.90%

We do not see a significant difference between the results that is because when we have propagated the labels of the centroids to all instances, we have also included outliers or instances which are ambiguous in terms of which cluster they fall in. So let us propagate the labels to the instances which are close (25 percentile) to the cluster centroids an train our models again.

Cluster	Logistic Regression	Neural Network
500	-	75.98%
1000	-	77.34%
2000	80.44%	79.42%

Finally for 2000 clusters we try to find the optimum distance from the centroid (in terms of percentile) so as to achieve maximum accuracy.

Percentile Distance	NN Accuracy
20	~10%
25	79.42%
30	80.12%
50	80.12%
75	79.60%

We can see that the optimum distance is around 30-50th percentile after which the accuracy drops (from both ends of the range). It is to be noted that although we could not improve upon our baseline accuracy using semi-supervised learning techniques, but in a situation where we get completely unlabelled data, these techniques come handy for boosting the accuracy of our models after we have labelled a small but somewhat significant portion of the data manually.

Note that the accuracy scores for neural network may vary due to random initialisations or GPU configuration. Also, we haven't included the output of the neural networks in the notebook to avoid making it unnecessarily long.

Details Of Pickle Files:

To load the joblib file

clf= load('cluster_nn_200.joblib')
clf.transform(X_train)

To load trained neural network

model_1 = tf.keras.models.load_model('nn_full_propagated_2000')
model_1.summary()

log_reg_orig.joblib : Logistic Regression on original dataset
log_reg_kmeans_0.joblib : Logistic Regression on distances after doing kmeans for 100 clusters
log_reg_kmeans_1.joblib : Logistic Regression on distances after doing kmeans for 200 clusters
log_reg_kmeans_2.joblib : Logistic Regression on distances after doing kmeans for 300 clusters
log_reg_few_label_0.joblib : Logistic Regression trained on 500 random instances
log_reg_few_label_1.joblib : Logistic Regression trained on 1000 random instances
log_reg_few_label_2.joblib : Logistic Regression trained on 2000 random instances
kmeans_2000.joblib : Kmeans on original training data with 2000 clusters
log_reg_centroids.joblib : Logistic Regression trained using the centroids of kmeans_2000.joblib
log_reg_propagated.joblib : Logistic Regression trained using the original data after propagating the labels of the centroids to entire dataset
log_reg_partially_propagated.joblib : Logistic Regression trained using the original data after propagating the labels of the centroids to those instances which fall under 25-percentile distance of the respective centroids
nn_original : Neural Network trained using original dataset
nn_labelled_500 : Neural Network trained on 500 random instances
nn_labelled_1000 : Neural Network trained on 1000 random instances
nn_labelled_2000 : Neural Network trained on 2000 random instances
cluster_nn_500.joblib : Cluster with 500 centroids for Neural Network
cluster_nn_1000.joblib : Cluster with 1000 centroids for Neural Network
cluster_nn_2000.joblib : Cluster with 2000 centroids for Neural Network
nn_centroid_cluster_500 : Neural Network trained using the centroids of cluster_nn_500.joblib
nn_centroid_cluster_1000 : Neural Network trained using the centroids of cluster_nn_1000.joblib
nn_centroid_cluster_2000 : Neural Network trained using the centroids of cluster_nn_2000.joblib
nn_full_propagated_500 : Neural Network trained using the original data after propagating the labels of the centroids of cluster_nn_500.joblib to entire dataset
nn_full_propagated_1000 : Neural Network trained using the original data after propagating the labels of the centroids of cluster_nn_1000.joblib to entire dataset
nn_full_propagated_2000 : Neural Network trained using the original data after propagating the labels of the centroids of cluster_nn_2000.joblib to entire dataset
nn_partially_propagated_500 : Neural Network trained using the original data after propagating the labels of the centroids of cluster_nn_500.joblib, to those instances which fall under 25-percentile distance of the respective centroids
nn_partially_propagated_1000 : Neural Network trained using the original data after propagating the labels of the centroids of cluster_nn_1000.joblib, to those instances which fall under 25-percentile distance of the respective centroids
nn_partially_propagated_2000 : Neural Network trained using the original data after propagating the labels of the centroids of cluster_nn_2000.joblib, to those instances which fall under 25-percentile distance of the respective centroids
nn_partially_propagated_2000-clusters_20-percentile : Neural Network trained using the original data after propagating the labels of the centroids of cluster_nn_2000.joblib, to those instances which fall under 20-percentile distance of the respective centroids
nn_partially_propagated_2000-clusters_25-percentile : Neural Network trained using the original data after propagating the labels of the centroids of cluster_nn_2000.joblib, to those instances which fall under 25-percentile distance of the respective centroids
nn_partially_propagated_2000-clusters_30-percentile : Neural Network trained using the original data after propagating the labels of the centroids of cluster_nn_2000.joblib, to those instances which fall under 30-percentile distance of the respective centroids
nn_partially_propagated_2000-clusters_50-percentile : Neural Network trained using the original data after propagating the labels of the centroids of cluster_nn_2000.joblib, to those instances which fall under 50-percentile distance of the respective centroids
nn_partially_propagated_2000-clusters_75-percentile : Neural Network trained using the original data after propagating the labels of the centroids of cluster_nn_2000.joblib, to those instances which fall under 75-percentile distance of the respective centroids