Raw-ML-Classifiers: A Python repository from arkilpatel

Raw ML Classifiers

This repository contains implementations of six Machine Learning Classifiers using only standard Python libraries such as NumPy and Pandas.

Dependencies

compatible with python 3.6

Models

The current repository includes 6 implementations of Models:

Naïve Bayes Classifier at ./src/naive_bayes.py
Decision Tree at ./src/decision_tree.py
Logistic Regression at ./src/logistic_regression.py
Random Forest at ./src/random_forest.py
Adaboost at ./src/adaboost.py
Stacking at ./src/stacking.py

Documentation:

Naïve Bayes Classifier

class NaiveBayes(type = “Gaussian”, prior = None)

This class implements Naïve Bayes with Gaussian and Multinomial types.

Parameters	type: str, ‘Gaussian’ or ‘Multinomial’, default: ‘Gaussian’ Specifies which type ‘Gaussian’ or ‘Multinomial’ is to be used. prior: array-like, shape (n_classes,) Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
Attributes	class_log_prior: array, shape(n_classes) Prior probability of each class._

Methods:

def fit(X, y)

Fit the model according to the given training data.

Parameters	X: {array-like, sparse matrix}, shape = (n_samples, n_features) Training Vector where n_samples is the number of samples and n_features is the number of features. y: array-like, shape = (n_samples,) Target vector relative to X.
Returns	self: Object of class NaiveBayes

def predict(X)

Get the class predictions.

Parameters	X: {array-like, sparse matrix}, shape = (n_samples, n_features) Input Vector where n_samples is the number of samples and n_features is the number of features.
Returns	C: array-like, shape = (n_samples) Predicted class label per sample.

Decision Tree

class DecisionTree(max_depth = 1,split_val_metric = 'mean', min_info_gain = 0.0, 
                    split_node_criterion = 'gini')

This class implements the Decision Tree Classifier.

Parameters

max_depth: int, default: 1
The maximum depth upto which the tree is allowed to grow.

split_val_metric: str, ‘mean’ or ‘median’, default: ‘mean’
This is the value of the column selected on which you make a binary split.

min_info_gain: float, default: 0.0
Minimum value of information gain required for node to split.

split_node_criterion: str, ‘gini’ or ‘entropy’, default: ‘gini’
Criterion used for splitting nodes.

Methods:

def build_tree(X, y, sample_weights)

Build the Decision Tree according to the given data.

Parameters

X: {array-like, sparse matrix}, shape = (n_samples, n_features)
Training Vector where n_samples is the number of samples and n_features is the number of features.

y: array-like, shape = (n_samples,)
Target vector relative to X.

sample_weight: array-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. Splits are also ignored if they would result in any single class carrying a negative weight in either child node.

Returns

self: Object of class DecisionTree

def predict_new(X)

Get the class predictions.

Parameters	X: {array-like, sparse matrix}, shape = (n_samples, n_features) Input Vector where n_samples is the number of samples and n_features is the number of features.
Returns	C: array-like, shape = (n_samples) Predicted class label per sample.

Logistic Regression

class LogisticRegression(regulariser = 'L2', lamda = 0.0, num_steps = 100000, learning_rate = 0.1, 
                          initial_wts = None, verbose=False)

This class implements the Logistic Regression Classifier with L1 and L2 regularisation.

Parameters

regulariser: str, ‘L1’ or ‘L2’, default: ‘L2’
Specifies whether L1 or L2 norm is to be used for regularisation.

lamda: float, default: 0.0
Hyperparameter for regulariser term. Default value indicates no regularisation is to be done. Directly proportional to regularisation strength.

num_steps: int, default: 100000
Number of iterations for which Gradient Descent will run.

learning_rate: float, default: 0.1
Indicates the size of step taken for Gradient Descent.

initial_wts: array, shape(n_features+1,), default: None
A list of n+1 values, where n is number of features. This is the initial value of the coefficients. If None, values of list assigned randomly in range normal (0,1).

verbose: bool, default: False
Whether to display loss after every 1000 iterations.

Attributes

theta_: array, shape(n_features,)
A list of weights.

bias_: int
Bias term for Logistic Regression equation.

para_theta_: dictionary, {(class, theta_)}
A dictionary with the weights corresponding to each class.

para_bias_: dictionary, {(class, bias_)}
A dictionary with the biases corresponding to each class.

hypothesis: dictionary, {(class, probs)}
A dictionary with the probabilities of each sample corresponding to each class.

Methods:

def fit(X, y)

Fit the model according to the given training data.

Parameters	X: {array-like, sparse matrix}, shape = (n_samples, n_features) Training Vector where n_samples is the number of samples and n_features is the number of features. y: array-like, shape = (n_samples,) Target vector relative to X.
Returns	self: Object of class NaiveBayes

def predict_prob(X)

Returns probability estimates for all classes ordered by label of classes.

Parameters	X: {array-like, sparse matrix}, shape = (n_samples, n_features) Input Vector where n_samples is the number of samples and n_features is the number of features.
Returns	T: array-like, shape = (n_samples, class) Returns the class for each sample according to whether n_classes>2 or not.

def predict(X)

Predicts the class labels for samples in X.

Parameters	X: {array-like, sparse matrix}, shape = (n_samples, n_features) Input Vector where n_samples is the number of samples and n_features is the number of features.
Returns	C: array-like, shape = (n_samples) Predicted class label per sample.

Random Forest

class RandomForest(n_trees=10, max_depth=5, split_val_metric='mean', min_info_gain=0.0, 
                    split_node_criterion='gini', max_features=5, bootstrap=True, n_cores=1)

This class implements the Random Forest Classifier.

Parameters

n_trees: int, default: 10
Number of estimator trees to build the ensemble.

max_depth: int, default: 1
The maximum depth upto which the tree is allowed to grow.

split_val_metric: str, ‘mean’ or ‘median’, default: ‘mean’
This is the value of the column selected on which you make a binary split.

min_info_gain: float, default: 0.0
Minimum value of information gain required for node to split.

split_node_criterion: str, ‘gini’ or ‘entropy’, default: ‘gini’
Criterion used for splitting nodes.

max_features: int, default: 5
The number of features to consider when looking for the best split.

bootstrap: bool, default: True
Whether bootstrap samples are to be used. If False, the whole dataset is used to build each tree.

n_cores: int, default: 1
Number of cores on which to run this learning process.

Methods:

def fit_predict(X_train, y_train, X_test)

Fits the model according to the given training data and predicts the class label for sample X_test.

Parameters

X_train: {array-like, sparse matrix}, shape = (n_samples, n_features)
Training Vector where n_samples is the number of samples and n_features is the number of features.

y_train: array-like, shape = (n_samples,)
Target vector relative to X.

X_test: {array-like, sparse matrix}, shape = (n_samples, n_features)
Input Vector where n_samples is the number of samples and n_features is the number of features.

Returns

C: array-like, shape = (n_samples)
Predicted class label per sample.

Adaboost

class AdaboostClassifier(max_depth = 1, n_trees = 10, learning_rate = 0.1)

This class implements Adaboost for Multiclass Classification.

Parameters

max_depth: int, default: 1
The maximum depth upto which the tree is allowed to grow.

n_trees: int, default: 10
Number of estimator trees to build the ensemble.

learning_rate: float, default: 0.1
Used in calculating error weights.

Attributes

trees_: list, shape: (n_trees,)
A list of tree estimators.

tree_weights_: array, shape: (n_trees,)
Weights of tree estimators.

tree_errors_: array, shape: (n_trees,)
Used to store errors for tree estimators.

n_classes: int
Number of disctinct classes.

classes_: array, shape: (n_classes,)
Array of class labels.

n_samples: int
Number of sample examples under consideration.

Methods:

def fit(X, y)

Fit the model according to given training data.

Parameters	X: {array-like, sparse matrix}, shape = (n_samples, n_features) Training Vector where n_samples is the number of samples and n_features is the number of features. y: array-like, shape = (n_samples,) Target vector relative to X.
Returns	self: Object of class AdaboostClassifier

def predict(X)

Predicts the class labels for samples in X.

Parameters	X: {array-like, sparse matrix}, shape = (n_samples, n_features) Input Vector where n_samples is the number of samples and n_features is the number of features.
Returns	C: array-like, shape = (n_samples) Predicted class label per sample.

Stacking

class Stack([(clf, count)])

This class implements the Stacking Ensemble Classifier.

Parameters

[(clf, count)]: list of tuples
First element of each tuple is a Classifier object. Second element of each tuple is how many times it is to be considered.

Methods:

def fit_pred(X_train, X_test, y_train)

Fits the model according to the given training data and predicts the class label for sample X_test.

Parameters

Returns

C: array-like, shape = (n_samples)
Predicted class label per sample.

For any clarification, comments, or suggestions please contact Arkil.

arkilpatel/Raw-ML-Classifiers