Objective - This is a repository created in line with my understanding & implementation of the major complex ideas in Machine Learning & Inferential Statistics. The motivation behind creating this open source resource is to validate my theoretical knowledge about these models and give back to the online community in terms of a structured and exhaustive format of learning fundamental concepts and best practices of Data Science in the industry.
Constituent Elements - This repository includes the following elements broadly:
-
Tutorial/comparison/benchmarking notebooks (In the form of end-to-end ML mini projects) for most of the popular ML and associated data science algorithms like sampling, feature selection, model interpretability, hypothesis testing in the following manner:
- Statistical Inference - Most of the major hypothesis tests and their pre-requisites/use-cases clearly outlined through a hierarchy
- Pre-Processing of data - Handling extreme imbalance in classes through statistical sampling
- Feature Selection - Generating feature importances and performing feature selection corresponding to supervised learning
- Explainable ML - Interpreting the intuition of predictions emanating out of the complex black box models
- Comparison of Dimensionality Reduction Techniques - Comparison of performance between various dimensionality reduction techniques like PCA, t-SNE & UMAP on a real ML pipeline
- Comparison of Dimensionality Reduction Techniques v2 (Malware data) - Comparison of performance between various dimensionality reduction techniques like PCA, t-SNE & UMAP on a real ML pipeline (Malware data classification)
- Comparison of Clustering Techniques - Comparison of performance between various clustering techniques like K-Means, Agglomerative/Hierarchial, DBSCAN & HDBSCAN on a real ML pipeline
-
From-scratch implementations (using numpy) of most of the commonly used ML models like linear models, CART, gradient boosting, DBSCAN, artificial neural networks etc. These from-scratch implementations have been created based on extensive personal research from online resources available and validated against the necessary benchmarks like Scikit-Learn, Tensorflow, Scipy etc. For all the cases, the manual implementations have matched up or outperformed the open source python benchmarking libraries mentioned above.
These include the following:-
Statistical Inference from scratch - Manual implementation of popular hypothesis tests like t-test, Kruskal Wallis, Friedmans's test etc. for higher visibility into the inner workings. These have been benchmarked against the available scipy & statsmodels versions
- Hypothesis testing from scratch
-
Supervised ML from scratch - Manual implementation of popular classical ML algorithms used for classification & regression. These have been benchmarked against the available Scikit-Learn or relevant versions. These include the following:
- K-Nearest Neighbors (KNN) from scratch
- Linear Regression (+Lasso +Ridge Regressions) from scratch
- Logistic Regression (with Stochastic Gradient Descent & Regularization) from scratch
- Decision Tree (~CART) for classification from scratch
- Bagging Ensemble for classification from scratch
- Random Forest ensemble for classification from scratch
- Stacking ensemble for classification from scratch
- Gradient Boosting Machine (GBMs) ensemble for regression from scratch
- Gradient Boosting Machine (GBMs) ensemble for classification from scratch
-
Unsupervised ML from scratch - Manual implementation of popular classical ML algorithms used for data exploration & pattern finding exercises. These have been benchmarked against the available Scikit-Learn or relevant versions. These include the following:
- Principal Component Analysis (PCA) from scratch
- K-Means Clustering from scratch
- DBSCAN - (Density-Based Spatial Clustering of Applications with Noise) from scratch
-
Artificial Neural Networks (ANN) from scratch - Manual implementation of deep neural networks with dynamic hyper-parameters as offered in the tensorflow packages. These have been benchmarked against the available Tensorflow, Keras or relevant versions. These include the following:
- ANN with L-layers & SGD from scratch
- ANN with regularization & SGD from scratch
- ANN with regularization, SGD & Dropout from scratch
-