/Machine-Learning-in-Python

My learnings on different algorithms of Machine Learning with Python .

Primary LanguageJupyter NotebookMIT LicenseMIT

Machine Learning in Python

This repository contains Machine Learning Projects in Python programming language. All the projects are done on Jupyter Notebooks.

Libraries Required

The following libraries are required to successfully implement the projects.

  • Python 3.6+
  • NumPy (for Linear Algebra)
  • Pandas (for Data Preprocesssing)
  • Scikit-learn (for ML models)
  • Matplotlib (for Data Visualization)
  • Seaborn (for statistical data visualization)

The projects are divided into various categories listed below -

Supervised Learning

  • Linear Regression

  • Logistic Regression : In this project, I train a binary Logistic Regression classifier to predict whether a student will get selected on the basis of mid semester and end semester marks.

  • Support Vector Machine : In this project, I build a Support Vector Machines classifier for predicting Social Network Ads . It predicts whether a user with age and estimated salary will buy the product after watching the ads or not. It uses the Radial Basic Function Kernal of SVM.

  • K Nearest Neighbours : K Nearest Neighbours or KNN is the simplest of all machine learning algorithms. In this project, I build a kNN classifier on the Iris Species Dataset which predict the three species of Iris with four features sepal_length, sepal_width, petal_length and petal_width.

  • Naive Bayes : In this project, I build a Naïve Bayes Classifier to classify the different class of a message from sklearn dataset called fetch_20newsgroups.

  • Decision Tree Classification : In this project, I used the Iris Dataset and tried a Decision Tree Classifier which give an accuracy of 96.7% which is less than KNN.

  • Random Forest Classification : In this project I used Random Forest Classifier and Random Forest Regressor on the Social Network Ads dataset.

Unsupervised Learning

  • K Means Clustering : K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw inferences.It is one of the most detailed projects, In this project, I implement K-Means Clustering on Credit Card Dataset to cluster different credit card users based on the features.I scaled the data using StandardScaler because normalizing(scale in range 0 to 1) will improves the convergence.I also implemented the Elbow Method to search for the best numbers of clusters.For visualizing the dataset I used PCA(Principal Component Analysis) for dimensionality reduction as the dataset features were large in number.In the end I used Silhouette Score which is used to calculate the performance of clustering . It ranges from -1 to 1 and I got a score of 0.203.

NLP( Natural Language Processing )

  • Text Analytics : It is a project for Introduction to Text Analytics in NLP.I performed the important steps -

    • Tokenization
    • Removal of Special Characters
    • Lower Case
    • Removing StopWords
    • Stemming
    • Count Vectorizer ( which generally performs all the steps mentioned above except Stemming)
    • DTM (Document Term Matrix)
    • TF-IDF (Text Frequency Inverse Document Frequency)
  • Sentiment Analysis : I applied Sentiment analysis in MovieReview (Dataset from nltk library) and RestaurentReview Datasets to predict the positive and negative review . I used Naive Bayes Classifier (78.8%) and Logistic Regression (84.3%) to build the models and for prediction.

Data Cleaning and Preprocessing

  • Data Preprocessing : I perform various data preprocessin and cleaning methods which are mentioned below -
    • Label Encoding : It converts each category into a unique numeric value ranging from 0 to n(size of dataset).
    • Ordinal Encoding : Categories to ordered numerical values.
    • One Hot Encoding : It creates a dummy variable with value 0 to n(unique value count in the column) for each category value.Extra columns are created.

Some Comparisons on Datasets

Social Network Ads Accuracy
Support Vector Machine 90.83%
Random Forest Classifier 90.0%
Random Forest Regressor 61.8%
Iris Dataset Accuracy
KNN 98.3%
Decision Tree 96.7%

Kaggle

Screenshot from 2021-08-05 06-34-18