/Machine-Learning-Journey

Implementations of ML and DL solution during my learning experience

Primary LanguageJupyter Notebook

Machine-Learning-Journey

This repository contains projects in the field of machine learning and deep learning that I have developed and that I continue to develop on my path to becoming a professional ML engineer. It includes links to the source codes that I have published as well as other related activities such as articles that I have published in journals like Towards Data Science and Geek Culture. My knowledge comes from work experiences like the internship at INRIA where for my thesis project I investigated pruning methods for neural network compression in Julia. Moreover given my personal passion for these topics I have studied independently on several books such as :

I also regularly consume machine learning related material such as the following youtube channels :

Finally in my learning path I have to mention the most common platforms such as Udemy where I enrolled a course for the implementation of deep learning algorithms on mobile devices called Deep Learning Course with Flutter & Python, and coursera where I learned the basics from the courses of Andrew NG

I also starte studying low level programming using cuda in order to boost deep learning performances, most of the framework such as TensorFlow and Pytorch are based on kernel lauches, check ouut my Cuda Programming Repo.

While in my A.I Art repository I have started publishing scripts about GANs that are able to generate art in terms of picture, audio, text etc...

Machine Learning

Recent Article 1

  • Binary Classification using scikit-learn

    I used the mnist dataset, modifying it slightly so that I could utilize a classifier that could recognize if a digit was "5" or "not 5". I evaluated the algorithms involved, random forest, SGD, and a dummy algorithm, using cross validation to be as accurate as possible. I noticed that in this case using the metric "accuracy" was not of much help, in fact even the dummy algorithm had a very high accuracy. This is because there are many more "not 5" images than "5" images. So I delved into metrics like Precision Recall and F1. I used these metrics to plot a PR curve comparing it also with various types of thresholds to understand which was the optimal point of the threshold to make classification. I also delved into the ROC curve and the AUC area. I used the mentioned ones to compare the various algorithms and understand that the best was the random forest.

  • Ensemble Learning: Bagging, Pasting, Boosting and Stacking

    Ensemble learning : combine few good predictors (decision tree, svm, etc...) to get one more accurate one predctor.

    Bagging and Pasting:

    These approaches use the same training algorithm for every predictor, but train them on different subsets of the training set. When sampling is performed with replacement, the method is called bagging, pasting otherwise. Random Forest is a example of bagging using decision trees, one of the most powerful algorithm in ML. It can also be used for feature selection.

    Out of bag evaluation Some instances may be sampled several times during bootstrapping, while others may not be sampled at all, these are called out-of-bag instances.

    Boosting This is another ensemble solution. The most famous boost methods are AdaBoosting and Gradient Boost.

    AdaBoost: Each new predictor (model) in the esnemble should focus on correct the instances that its predecessor underfitted, weighting the missclassified instances. The boosting cannot be parallelized, because each predictor should wait for the previous one. In scikit learn the "SAMME" algorithm is used for multiclass labels AdaBoost. While "SAMME.R" relies on probabilities instead of predictions, usually performs better.

    GradientBoost: Similar to AdaBoosting but instead of working on the weights, each predictor tries to fit the residuals errors of the previous predictor.

    Stacking: This is the last ensemble method. Instead of aggregating the predictors with trivial methods like majority voting, we train a model to perform the aggregation. Each tree predicts a certain value, and the final predictor called blender or meta-learner takes these predictions and output the final value.

  • Dimensionality Reduction

    Projection and Manifold Learning. PCA, Kernel PCA, Incremental PCA to speed up computation and for data visualization.

  • Solution to Titanic-Machine Learning from Disaster Kaggle's challenge

    Kaggle challenge : https://www.kaggle.com/c/titanic/overview
    In this code I developed a custom sklearn tranformer.
    I used a pipeline to preprocess the data.
    The model selection was run using both randomized_search and optuna.

  • Basic operations in PyTorch

  • Linear and Logistic Regression in PyTorch

Deep Learning