/Implementation-of-Machine-Learning-Algorithm-from-Scratch

Learn Machine Learning from basic to advance and develop Machine Learning Models from Scratch in Python

Primary LanguageJupyter NotebookMIT LicenseMIT

Implementation of Machine Learning Algorithm from Scratch

Learn Machine Learning from basic to advance and develop Machine Learning Models from Scratch in Python

1. WHAT YOU WILL LEARN?

  • Obtain a solid understand of machine learning in general from basic to advance
  • Complete tutorial about basic packages like NumPy and Pandas
  • Data Preprocessing and Data Visualization
  • Have an understand of Machine Learning and how to apply it in your own programs
  • Understanding the concept behind the algorithms
  • Knowing how to optimize hyperparameters of your models
  • Learn how to develop models based on the requirement of your future business
  • Potential for a new job in the future

2. DESCRIPTION

Are you interested in Data Science and Machine Learning, but you don’t have any background, and you find the concepts confusing?
Are you interested in programming in Python, but you always afraid of coding?

😊I think this repo is for you!😊

Even you are familiar with machine learning, this repo can help you to review all the techniques and understand the concept behind each term. This repo is completely categorized, and I don’t start from the middle! I actually start the concept of every term, and then I try to implement it in Python step by step. The structure of the course is as follows:

3. WHO THIS REPO IS FOR:

  • Anyone with any background that interested in Data Science and Machine Learning with at least high school (+2) knowledge in mathematics
  • Beginners, intermediate, and even advanced students in the field of Artificial Intelligence (AI), Data Science (DS), and Machine Learning (ML)
  • Students in college that looking for securing their future jobs
  • Students that look forward to excel their Final Year Project by learning Machine Learning
  • Anyone who afraid of coding in Python but interested in Machine Learning concepts
  • Anyone who wants to create new knowledge on the different dataset using machine learning
  • Students who want to apply machine learning models in their projects

4. Contents

Useful Resources

Title Repository
USEFUL GIT COMMANDS FOR EVERYDAY USE πŸ”—
MOST USEFUL LINUX COMMANDS EVERYONE SHOULD KNOW πŸ”—
AWESOME ML TOOLBOX πŸ”—

Installation

Title Repository
INSTALL THE ANACONDA PYTHON ON WINDOWS AND LINUX πŸ”—

Reality vs Expectation

Title Repository
IS AI OVERHYPED? REALITY VS EXPECTATION πŸ”—

Machine Learning from Beginner to Advanced

Title Repository
HISTORY OF MATHEMATICS, AI & ML - HISTORY & MOTIVATION πŸ”—
INTRODUCTION TO ARTIFICIAL INTELLIGENCE & MACHINE LEARNING πŸ”—
KEY TERMS USED IN MACHINE LEARNING πŸ”—
PERFORMANCE METRICS IN MACHINE LEARNING CLASSIFICATION MODEL πŸ”—
PERFORMANCE METRICS IN MACHINE LEARNING REGRESSION MODEL πŸ”—

Scratch Implementation

Title Repository
LINEAR REGRESSION FROM SCRATCH πŸ”—
LOGISTIC REGRESSION FROM SCRATCH πŸ”—
NAIVE BAYES FROM SCRATCH πŸ”—
DECISION TREE FROM SCRATCH πŸ”—
RANDOM FOREST FROM SCRATCH πŸ”—
K NEAREST NEIGHBOUR πŸ”—
NAIVE BAYES πŸ”—
K MEANS CLUSTERING πŸ”—

Mathematical Implementation

Title Repository
CONFUSION MATRIX FOR YOUR MULTI-CLASS ML MODEL πŸ”—

Machine Learning Interview Questions with Answers

Title Repository
50 QUESTIONS ON STATISTICS & MACHINE LEARNING – CAN YOU ANSWER? πŸ”—

Essential Machine Learning Formulas

Title Repository
MOSTLY USED MACHINE LEARNING FORMULAS πŸ”—

Pratice Guide for Data Science Learning

Title Repository
Research Guide for FYP πŸ”—
The Intermediate Guide to 180 Days Data Science Learning Plan πŸ”—

Algorithm Pros and Cons

  • KN Neighbors
    βœ” Simple, No training, No assumption about data, Easy to implement, New data can be added seamlessly, Only one hyperparameter
    βœ– Doesn't work well in high dimensions, Sensitive to noisy data, missing values and outliers, Doesn't work well with large data sets β€” cost of calculating distance is high, Needs feature scaling, Doesn't work well on imbalanced data, Doesn't deal well with missing values

  • Decision Tree
    βœ” Doesn't require standardization or normalization, Easy to implement, Can handle missing values, Automatic feature selection
    βœ– High variance, Higher training time, Can become complex, Can easily overfit

  • Random Forest
    βœ” Left-out data can be used for testing, High accuracy, Provides feature importance estimates, Can handle missing values, Doesn't require feature scaling, Good performance on imbalanced datasets, Can handle large dataset, Outliers have little impact, Less overfitting
    βœ– Less interpretable, More computational resources, Prediction time high

  • Linear Regression
    βœ” Simple, Interpretable, Easy to Implement
    βœ– Assumes linear relationship between features, Sensitive to outliers

  • Logistic Regression
    βœ” Doesn’t assume linear relationship between independent and dependent variables, Output can be interpreted as probability, Robust to noise
    βœ– Requires more data, Effective when linearly separable

  • Lasso Regression (L1)
    βœ” Prevents overfitting, Selects features by shrinking coefficients to zero
    βœ– Selected features will be biased, Prediction can be worse than Ridge

  • Ridge Regression (L2)
    βœ” Prevents overfitting
    βœ– Increases bias, Less interpretability

  • AdaBoost
    βœ” Fast, Reduced bias, Little need to tune
    βœ– Vulnerable to noise, Can overfit

  • Gradient Boosting
    βœ” Good performance
    βœ– Harder to tune hyperparameters

  • XGBoost
    βœ” Less feature engineering required, Outliers have little impact, Can output feature importance, Handles large datasets, Good model performance, Less prone to overfitting \​ βœ– Difficult to interpret, Harder to tune as there are numerous hyperparameters

  • SVM
    βœ” Performs well in higher dimensions, Excellent when classes are separable, Outliers have less impact
    βœ– Slow, Poor performance with overlapping classes, Selecting appropriate kernel functions can be tricky

  • NaΓ―ve Bayes
    βœ” Fast, Simple, Requires less training data, Scalable, Insensitive to irrelevant features, Good performance with high-dimensional data
    βœ– Assumes independence of features

  • Deep Learning
    βœ” Superb performance with unstructured data (images, video, audio, text)
    βœ– (Very) long training time, Many hyperparameters, Prone to overfitting



AI/ML dataset

Source Link
Google Dataset Search – A search engine for datasets: πŸ”—
IBM’s collection of datasets for enterprise applications πŸ”—
Kaggle Datasets πŸ”—
Huggingface Datasets – A Python library for loading NLP datasets πŸ”—
A large list organized by application domain πŸ”—
Computer Vision Datasets (a really large list) πŸ”—
Datasetlist – Datasets by domain πŸ”—
OpenML – A search engine for curated datasets and workflows πŸ”—
Papers with Code – Datasets with benchmarks πŸ”—
Penn Machine Learning Benchmarks πŸ”—
VisualDataDiscovery (for Computer Vision) πŸ”—
UCI Machine Learning Repository πŸ”—
Roboflow Public Datasets for computer vision πŸ”—