/Labeled-Faces-in-the-Wild

Based on generalisation behaviour predicted by k-fold cross-validation, the best learning model is picked to label face images.

Primary LanguageJupyter Notebook

\documentclass{article}

\usepackage{hyperref}

\title{README \\ CS3244 Homework 3}
\author{Thenaesh Elango, Lee Yi Min, Lim Jia Yee}

\begin{document}

\maketitle
\tableofcontents
\newpage

% 1
\section{Introduction}
We can recognise faces after seeing them for a few times. We have essentially ``learnt" how to recognise these faces, especially from the more distinctive features of these faces. They include the shape of the face, the curvature of the eyes, the slope of the nose, the thickness of the lips, and so on. We aim to extend this ``learning" to computer programs.

% 2
\section{Description}
In the field of machine learning, this problem of face recognition, given previously labelled faces, is a classification problem. These previously labelled data will be referred to as the training data. \\

The learner, essentially a computer program, attempts to create a general formula based on the training data to ``recognise" other face images not in the training data.

% 2.1
\subsection{Data Sets}
There are 966 training samples and 322 test samples. Each sample has 1850 features, where each feature is simply a pixel. The faces belong to a total of 7 people, and thus there are 7 classes.

% 3
\section{Learning Models Used}
% 3.1
\subsection{Initial Approach: Raw Support Vector Machines and Neural Networks}
The initial models used were SVMs. All 966 training samples and their 1850 features were passed into the SVM. This means there was no feature dimensionality reduction and no cross-validation. \\

There were a total of more than 80 SVM models with various combinations of arguments, namely kernels, costs, gammas (applicable for radial basis function (RBF) and polynomial kernel only), degrees (applicable for polynomial kernel only). \\

The kernels used were the linear kernel, the polynomial kernel of degrees 2 to 4 inclusive, and the RBF kernel. \\

The following ranges are of step size power of 10: The range of costs used was 0.01 to 100 inclusive. The range of gammas used was 0.0001 to 1 inclusive.

For the SVM models using the RBF kernel, all of the test samples were classified under the same class, which seemed to be suspicious. On the other hand, for the SVM models using the linear or the polynomial kernels, the test samples had a diverse classification, and the SVM models generally agree on the classification for each of the test samples for most of the combinations of the arguments. \\

A possible explanation for the uniform classification under the RBF kernel could be that RBF is intolerant to slack \cite{bib-01} and thus could not separate the data points which were likely to be close together due to the sheer number of features, many of which could be similar as the human face does not deviate drastically. \\

As for the linear and polynomial kernels, the classifications were agreeable amongst these models due to:
\begin{enumerate}
	\item Rather low degrees compared to the number of features
	\item Arguments for costs, gamma (if applicable) are within a small range
\end{enumerate}

For any disagreements, the mode of the classification was chosen. \\

The simple SVM models were decent as the first step, but could have been improved in the following:
\begin{enumerate}
	\item Ensure the images were oriented in the same angle
	\item Gather the more significant features (or pixels) for classification to reduce noise, time, and space complexity
	\item Need to try out other models and also think about whether the possible natures of the data set could affect the learning of the different models
\end{enumerate}

We tried using neural networks in place of SVMs, and the results were similar.

As such, we need a new approach that takes into account the above-mentioned improvements.
\newpage

% 3.2
\subsection{Chosen Approach: Neural Networks with Image Alignment, Scaling and PCA}
\label{sec:3.2}
\paragraph{}
In our chosen approach, we use a method known as principal component analysis, which uses unsupervised learning to identify key features and discard those features that are not essential. We then scale the data to have a zero mean and unit variance.

\paragraph{}
Before scaling and applying PCA, however, we preprocess the images to remove some unnecessary artifacts that naturally manifest in photos. In particular, we want to make sure that the images are upright and that the face in each photo is not slanted. To do this, we perform an operation to align the eyes each image in a standard manner.

\paragraph{}
After performing the preprocessing on the training points, we perform a 10-fold split of the preprocessed training set $(X,y)$ into all possible combinations of $X_{train}, y_{train}, X_{test}, y_{test}$. For each $X_{train}, y_{train}, X_{test}, y_{test}$, we train a neural network on $X_{train}, y_{train}$, and subsequently use that neural network to predict $X_{test}$ and compute the $F_1$ score of the prediction against $y_{test}$. We then have one neural network and $F_1$ score for each possible split. We keep the neural network with the best $F_1$ score and discard the rest. This neural network, together with the associated PCA transformer and scaler, forms the trained model.

\paragraph{}
The training in our chosen approach may be outlined as follows:
\begin{enumerate}
	\item Process the images in the training set to make sure that they are oriented in the same angle and are aligned.
	\item Scale the images in the training set to make sure the classifiers used behave optimally.
	\item Reduce the number of features in the training set.
	\item Train a neural network on the training set.
\end{enumerate}

\paragraph{}
When using the trained model to predict the classifications of a dataset $X'$, we first perform the image alignment preprocessing on $X'$. We then scale and apply the same PCA transform we applied to train the model. Finally, we use the neural network to predict the model.

\paragraph{}
This approach gives generally much better $F_1$ scores, typically crossing $0.9$.
\newpage

% 3.3
\subsection{Other Approaches}
% 3.3.1
\subsubsection{Decision Trees, and Adaptive Boosting (Adaboost)}
Since the features are simply about the individual pixels in the face images, the training data used in this section's approach also underwent PCA transformation in order to attain the more significant features out of all the individual pixels. \\

A reduced number of features is especially important for decision trees (and Adaboost) because it is expected that each individual pixel will not be effective in maximising information gain. Each pixel is not only affected by the face in the image, but also external noise such as lighting, orientation, and shadow. Hence, an individual pixel by itself may not be effective ``decision makers". \\

However, despite feature dimensionality reduction with PCA, the F1 scores for both decision trees and Adaboost classifiers lose out to the other classifiers (SVM and neural networks). The F1 score for decision trees has an average of 0.39 while that for the Adaboost is 0.41. \\

Adaboost is an ensemble method which uses several weak learners (i.e. decision stumps in this context) to combine into a stronger learner. Since Adaboost tries to correct the misclassifications, anomalous or noisy data will steer Adaboost away from the true classification as the learning algorithm attempts to classify these data points with deceiving features \cite{bib-02}. \\

It is expected that the data is noisy, as several external factors affect the quality of the images as discussed earlier in this section.

% 3.3.2
\subsubsection{Bagging}
It was noticed that 966 training samples may not be enough for the most accurate classification. In this section, bagging (also known as bootstrap aggregating) was attempted on the SVM model in \hyperref[sec:3.3]{Section 3.3 Final Approach}. \\

However, there was no significant increase in the F1 score for bagging in SVM, compared to no bagging in SVM (still 0.89).

% insert something to explain why?
\newpage

% 4
\section{File Listing}
% Need halp from everybody here!
\newpage

% 5
\section{Statement of Teams' Independent Work}
Please initial (between the square brackets) one of the following statements. \\

[X] We, A0124772E, A0121299A and A0136070R, certify that we have followed the CS3244 Machine Learning class guidelines for homework assignments. In particular, we expressly vow that we have followed the Facebook rule in discussing with others (out of our team) in doing the assignment and did not take notes (digital or printed) from the discussions. \\

[] We, $<$substitute your matric numbers here$>$, did not follow the class rules regarding the homework assignment, because of the following reason: $<$Please fill in$>$
We suggest that we should be graded as follows: $<$Please fill in$>$
\newpage

\begin{thebibliography}{20}
% bib-01
\bibitem{bib-01}
	Martin C.,
	\emph{\href{https://charlesmartin14.wordpress.com/2012/02/06/kernels_part_1/}{What is an RBF Kernel?}},
	2012, Feb 6.
	
% bib-02
\bibitem{bib-02}
	Brownlee J.,
	\emph{\href{http://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/}{Boosting and AdaBoost for Machine Learning}},
	2016, Apr 25.
	
% bib-03
\bibitem{bib-03}
	Geitgey A.,
	\emph{\href{https://medium.com/@ageitgey/machine-learning-is-fun-part-4-modern-face-recognition-with-deep-learning-c3cffc121d78\#.e531kjh1o}{Machine Learning is Fun! Part 4: Modern Face Recognition with Deep Learning}},
	2016, Jul 24.
	
% bib-04
\bibitem{bib-04}
	Mallick S.,
	\emph{\href{http://www.learnopencv.com/average-face-opencv-c-python-tutorial/}{Average Face : OpenCV ( C++ / Python ) Tutorial}},
	2016, May 7.
	
% bib-05
\bibitem{bib-05}
	\emph{\href{http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html}{Faces recognition example using eigenfaces and SVMs}}
	
% bib-05
\bibitem{bib-05}
	\emph{\href{scikit-learn.org/stable/modules/neural_networks_supervised.html}{Neural network models (supervised)}}
\end{thebibliography}
\end{document}