Note: this is a fail project
CZ4041 Machine Learning Assignment
Let's start by installing all required modules for this project. We install the following requirements from official website:
sudo apt-get install cuda
will install Cuda 7.5, but we need cuda 7.0, so please install Cuda using sudo apt-get install cuda-7.0
If you have trouble waiting for NVDIA approval so that you can download CUDNN, you can use this link to download the file.
Then, we install all other modules using the following commands:
pip install -r requirements.txt
Next, you need to login to Kaggle and download the dataset from Kaggle Competition. Next, you should put all the downloaded files, i.e. sampleSubmission.csv
, test.7z
, train.7z
, and trainLabels.csv
, to dataset
folder. This folder will not be checked in to Git version control. To extract the .7z
files, you need to download 7zip in the official website, or use brew to install it if you're using MAC OS X:
brew install p7zip
# To check whether you finished your download
python src/check_dataset.py
After downloading 7zip, we extract the .7z
files to testdata
folder and traindata
folder. Similary, these folders will not be checked in. You should be able to extract these folders easily in Windows. You can use the following command if you're using MAC OS X or Linux:
# The extraction should take around 1 to 2 hour
7z e -y -otraindata dataset/train.7z
7z e -y -otestdata dataset/test.7z
After extracting the files, we might want to take a look at the data. The following command will open 10 random images from traindata
folder:
python src/show_random_picture.py
Now we want to change all images to matrices, so that we can feed these matrices to our awesome classifier. We have have wrote a script to change an image to a three dimensional 32 X 32 X 3
matrix, so the whole dataset is represented in a four dimensional matrix (with an additional dimension for the number of images). The script will also turn train labels in .csv file format to a one dimensional array. We use the following command to convert images to matrices:
# The conversion will take around 5-10 minutes
# After the conversion, we will the two more .pickle file at root directory
# cifar10_test.pickle should be around 3.69GB
# cifar10_train.pickle should be around 614.8MB
python src/images_to_matrices.py
At this point, we already have compressed data in dataset
folder, images in testdata
and traindata
folder, and two .pickle
files at the root directory. Our folder structure should look like the following:
| cifar10
| -- dataset
| ---- sampleSubmission.csv
| ---- test.7z
| ---- train.7z
| ---- trainLabels.csv
| -- src
| ---- ....
| -- testdata
| ---- "a bunch of images"
| -- traindata
| ----- "a bunch of images"
| cifar10_test.pickle
| cifar10_train.picle
| ...
We are ready to start our development now.
If you do not have a GPU with compute capability of 3.5 and above, you probably want to use Amazon EC2 instance. For this project, we can use GPU instance g2.2xlarge
specifications with the image ami-cf5028a5
at North Virginia. By using this image, Tensorflow with GPU support is installed. You should use at least need a total of 14 GB of EBS Volume.
We cannot use wget or curl to download data from Kaggle. The only solution is to use Lynx (a text based web browser) to download the files:
- Install Lynx
- Create a ~/.lynxrc configuration file
- Call the browser
lynx -cfg=~/.lynxrc www.kaggle.com
- Log in, browse to the competition data page and accept the terms and permissions (if you haven't yet)
- Select the link to the file you want to download, and press "d". The download will start.
- Once the download is finished, select "save file to disk" and provide filename/destination where you want to store the data
- Repeat for other files
# ~/.lynxrc configuration file
SET_COOKIES:TRUE
ACCEPT_ALL_COOKIES:TRUE
PERSISTENT_COOKIES:TRUE
COOKIE_FILE:~/.lynx_cookies
COOKIE_SAVE_FILE:~/.lynx_cookies
You cannot used linked account (Google, Yahoo, etc. ) to login in Lynx. In order to login in to Kaggle, you need to setup a username and password, as described by Kaggle blog.
Source: Kaggle Forum
Source: Erikbern Github Gist
To train a logistic regression classifier using 1000 train data:
# Accuracy is around 9.7%
python src/logistic_regression.py
To train a multinomial logistic regression classifier with stochastic gradient descent:
# Accuracy is around 10.0%
python src/multinomial_logistic_regression.py
To train a multilayer perceptron classifier with stochastic gradient descent:
python src/multilayer_neural_network.py
Please follow the guide in CONTRIBUTING.md