Classification Tasks

Import the necessary data manipulation and visualization libraries

  import pandas as pd
  from matplotlib import pyplot as plt
  import seaborn as sns
  import numpy as np

Import the machine learning algorithms from sklearn

    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC, LinearSVC
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.naive_bayes import GaussianNB
    from sklearn.linear_model import Perceptron
    from sklearn.linear_model import SGDClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.cluster import KMeans

Import some utility and metric classes

    from sklearn.metrics import confusion_matrix
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import StandardScaler
    from sklearn.cross_validation import KFold, cross_val_score

Load the data (in this case we have separate datasets for training and testing)
Analyze the data to get a feel of the data
Write a block of code or function (for both training and testing data) to: 1. Get all the columns from the dataset 2. Determine if the values in a column is numeric 3. Clean all numeric columns 4. Get a list of all categorical columns with NAN values
Get a count of all the NAN values in categorical columns
Write a function or a block of code (for both training and testing data) to: 1. Use the categorical columns to get the most most occurring 2. Create a dictionary with the column and associated most occurring category
Using the dict and the fillna function fill all NAN categorical values in the dataset.
Ensure that there are no more NAN values in the categorical columns
Get all the numeric columns that contain any NAN values
Choose an appropriate method to fill the NAN values in the numeric columns
Ensure that there are no more NAN values in the numeric columns
Get a list of all categorical columns
Encode all the categories here is a good blog post
Split the training data into X_train, Y_train
Split the test data into X_test, Y_test
Using the machine learning algorithms imported in 2 train and test each classifier and keep track of their accuracies.
Use the Standard Scaler to scale all the attributes
Utilize PCA on both the training and test set to try to improve the performance
Retrain the Random Forest and KNN classifiers
Use KFold Cross Validation to determine the best value to use for K in the KNN classifier
Retrain the KNN classifier

nick-singh/classification-warmup

Classification Tasks