/machine-learning-algorithms

A collection of algorithms used in machine learning. I'm going to add to this repo whenever I have time.

Primary LanguagePython

Machine learning algorithms

This repo contains my implementations of various machine learning algorithms using sklearn and Python.

Prerequisites

To run the scripts found in this repository, you'll need to install the following Python packages:

  • sklearn
  • matplotlib
  • numpy
  • pandas

Additionally, if you want a graphical reprentation of the decision tree, you'll need to get GraphViz from here. Make sure the dot.exe from GraphViz's bin directory is in your PATH, or else the script will throw a FileNotFound error.

Algorithms currently implemented:

  1. Linear Regression
  2. K-Nearest Neighbour
  3. Support Vector Machine
  4. K-Means Clustering
  5. Decision Trees
  6. Neural Networks

Linear Regression

Random data points and their linear regression.
Image taken from wikimedia.

Definition:

Linear Regression is the process of finding a line that best fits the data points available on the plot, so that we can use it to predict output values for inputs that are not present in the data set we have, with the belief that those outputs would fall on the line. -- Anas Al-Masri

Problem solved using the algorithm: Estimating the grades of students in G3 based on their results in G1 and G2 as well as their absences during the academic year, their failures and the time studied per week.

Besides predicting the final grade of a student, the linear_regression.py can also plot the relationship between two sets of data.

Accuracy: R²-Score of ~0.75 - ~0.9

In the linear_regression directory you can also find the linear_regression_no_lib.py which is my implementation of linear regression without using sklearn.

K-Nearest Neighbour

The 1NN classification map based on the CNN extracted prototypes.
Image taken from wikimedia and made by user Agor153.

Definition:

KNN works by finding the distances between a query and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression). -- Onel Harrison

Problem solved using the algorithm: Classify the acceptability of a car based on buying and maintenance price, number of doors, capacity in terms of persons, size of luggage boot and the car's safety rating.

Accuracy: ~95% - ~98%

Support Vector Machine

Kernel machines are used to compute non-linearly separable functions into a higher dimension linearly separable function.
Image taken from wikimedia and made by user Alisneaky.

Definition:

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. -- Savan Patel

Problem solved using the algorithm: Classify tumors as benign or malign, based on about 30 criteria such as size, growth and many more.

Accuracy: ~92% - ~96%

K-Means Clustering

Convergence of k-means clustering from an unfavorable starting position (two initial cluster centers are fairly close).
Image taken from wikimedia and made by user Chire.

Definition:

The k-means clustering algorithm attempts to split a given anonymous data set (a set containing no information as to class identity) into a fixed number (k) of clusters. Initially k number of so called centroids are chosen. A centroid is a data point (imaginary or real) at the center of a cluster. -- Ola Söder

Problem solved using the algorithm: Classifing handwritten digits.

Accuracy:

  • inertia: 69510
  • homogeneity: ~0.61
  • completeness: ~0.66
  • v-measure: ~0.63
  • adjusted-rand: ~0.48
  • adjusted-mutual-info: ~0.61
  • silhouette: ~0.14

Explanation:

(can also be found as a comment in the k_means_cluster.py)

  • inertia: within-cluster sum-of-squares
  • homogeneity: each cluster contains only members of a single class (range 0 - 1)
  • completeness: all members of a given class are assigned to the same cluster (range 0 - 1)
  • v-measure: harmonic mean of homogeneity and completeness
  • adjusted_rand: similarity of the actual values and their predictions, ignoring permutations and with chance normalization (range -1 to 1, -1 being bad, 1 being perfect and 0 being random)
  • adjusted_mutual_info: agreement of the actual values and predictions, ignoring permutations (range 0 - 1, with 0 being random agreement and 1 being perfect agreement)
  • silhouette: uses the mean distance between a sample and all other points in the same class, as well as the mean distance between a sample and all other points in the nearest cluster to calculate a score (range: -1 to 1, with the former being incorrect, and the latter standing for highly dense clustering. 0 indicates overlapping clusters.

Decision Trees

A sample tree
Image generated using decision_tree.py.

Definition:

In computer science, Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). -- Wikipedia

Problem solved using the algorithm: Predicting the onset of diabetes based on diagnostic measures.

Accuracy:

  • ~78% with a depth of 4 and information gain as attribute selection measure
  • ~76% with a depth of 3 and information gain as attribute selection measure
  • ~76% with a depth of 4 and gini impurity as attribute selection measure

Neural Networks

A network with the topology 4-10-3
The first neural network declared in neural_network.py

A neural network with the topology 4-5-3-3
The second network declared in neural_network.py

Definition:

Artificial neural networks (ANN) or connectionist systems are computing systems that are inspired by, but not identical to, biological neural networks that constitute animal brains. Such systems "learn" to perform tasks by considering examples, generally without being programmed with task-specific rules. -- Wikipedia

Problem solved using neural networks: Classifcation of iris flowers based on petal length and width as well as sepal length and width.

Accuracy: Both networks get a mean accuracy of around 0.9 to 1.0.