Classification Tasks

  1. Import the necessary data manipulation and visualization libraries
  import pandas as pd
  from matplotlib import pyplot as plt
  import seaborn as sns
  import numpy as np
  1. Import the machine learning algorithms from sklearn
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC, LinearSVC
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.naive_bayes import GaussianNB
    from sklearn.linear_model import Perceptron
    from sklearn.linear_model import SGDClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.cluster import KMeans
  1. Import some utility and metric classes
    from sklearn.metrics import confusion_matrix
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import StandardScaler
    from sklearn.cross_validation import KFold, cross_val_score
  1. Load the data (in this case we have separate datasets for training and testing)
  2. Analyze the data to get a feel of the data
  3. Write a block of code or function (for both training and testing data) to: 1. Get all the columns from the dataset 2. Determine if the values in a column is numeric 3. Clean all numeric columns 4. Get a list of all categorical columns with NAN values
  4. Get a count of all the NAN values in categorical columns
  5. Write a function or a block of code (for both training and testing data) to: 1. Use the categorical columns to get the most most occurring 2. Create a dictionary with the column and associated most occurring category
  6. Using the dict and the fillna function fill all NAN categorical values in the dataset.
  7. Ensure that there are no more NAN values in the categorical columns
  8. Get all the numeric columns that contain any NAN values
  9. Choose an appropriate method to fill the NAN values in the numeric columns
  10. Ensure that there are no more NAN values in the numeric columns
  11. Get a list of all categorical columns
  12. Encode all the categories here is a good blog post
  13. Split the training data into X_train, Y_train
  14. Split the test data into X_test, Y_test
  15. Using the machine learning algorithms imported in 2 train and test each classifier and keep track of their accuracies.
  16. Use the Standard Scaler to scale all the attributes
  17. Utilize PCA on both the training and test set to try to improve the performance
  18. Retrain the Random Forest and KNN classifiers
  19. Use KFold Cross Validation to determine the best value to use for K in the KNN classifier
  20. Retrain the KNN classifier