Data Science Portfolio

Repository containing portfolio of data science projects completed by me for academic, self learning, and hobby purposes. Presented in the form of Jupyter notebooks, and R markdown files (published at RPubs).

For a more visually pleasant experience for browsing the portfolio, check out sajalsharma.com

The R portfolio is located here.

Note: Data used in the projects (accessed under data directory) is for demonstration purposes only.

Instructions for Running Python Notebooks Locally

Install dependencies using requirements.txt.
Run notebooks as usual by using a jupyter notebook server, Vscode etc.

Machine Learning
- Predicting Boston Housing Prices: A model to predict the value of a given house in the Boston real estate market using various statistical analysis tools. Identified the best price that a client can sell their house utilizing machine learning.
- Supervised Learning: Finding Donors for CharityML: Testing out several different supervised learning algorithms to build a model that accurately predicts whether an individual makes more than $50,000, to identify likely donors for a fictional non-profit organisation.
- Unsupervised Learning: Creating Customer Segments: Analyzing a dataset containing data on various customers' annual spending amounts (reported in monetary units) of diverse product categories for discovering internal structure, patterns and knowledge.
- Reinforcement Learning: Training a Smartcab to Drive: Creating an optimized Q-Learning driving agent that will navigate a Smartcab through its environment towards a goal.
- Deep Learning: Digit Sequence Recognition using CNNs: Designing and implementing a Convolutional Neural Network that learns to recognize sequences of digits using synthetic data generated by concatenating images from MNIST.
Tools: scikit-learn, Pandas, Seaborn, Matplotlib, Pygame
Natural Language Processing
- Disaster Message Classifier: A multilabel classification model to predict the categories of a disaster message. Includes an ETL pipeline for data processing, a ML pipeline to train the model, and a web app, with visualizations, where the model can be used to classify messages. Tools: NLTK, Scikit-learn, XGBoost, Flask, Plotly
- 3-way Sentiment Analysis for Tweets: 3-way polarity (positive, negative, neutral) classification system for tweets, without using NLTK's sentiment analysis engine.
- Cross language Information Retrieval: Cross language information retrieval system (CLIR) which, given a query in German, searches text documents written in English.
Tools: NLTK, scikit
Data Analysis and Visualisation
- Python
  - Scalable Walkability Analysis of Melbourne: Analysis of walkability of suburbs in Melbourne, Victoria and its implications.
  - Titanic Dataset - Exploratory Analysis: Exploratory Analysis of the passengers onboard RMS Titanic using Pandas and Seaborn visualisations.
  - Stock Market Analysis for Tech Stocks: Analysis of technology stocks including change in price over time, daily returns, and stock behaviour prediction.
  - 2016 US General Election Poll Data Analysis: Very simple analysis of 2016 US General Election Poll data.
  - 911 Calls - Exploratory Analysis: Exploratory Data Analysis of the 911 calls dataset hosted on Kaggle. Demonstrates extraction of useful features from different variables.
Tools: Pandas, Folium, Seaborn and Matplotlib
- R
  - Behavioral Risk Factor Surveillance System(BRFSS) 2013: Exploratory Data Analysis: Exploratory analysis of the BRFSS-2013 data set, focusing on investigating the relationship between education and eating habits, sleep and mental health, and smoking, drinking and general health of a person.
  - Inferential Statistics: Do men or women oppose sex education? : Using the GSS (General Social Survey) dataset to infer if, in the year 2012, were men, of 18 years or above in the United States, more likely to oppose sex education in public schools than women.
  - Data Visualization: Corruption and Human Development: A scatter plot for the relationship between the 'Human Development Index' and the 'Corruption Perceptions Index' of countries.
  - Moneyball: Analysing and replacing lost players: Exploration of baseball data for the year 2001 to look at replacements for key players lost by the Oakland A's in 2001. Inspired by the book/movie: Moneyball.
Micro Projects:
- Python
  - ML with Logistic Regression: Using Logistic Regression to predict whether an internet user clicked an ad or not.
  - ML with K Nearest Neighbours: Using KNN to classify instances from a fake dataset into two target classes, while choosing the best value for K using the elbow method.
  - ML with Decision Trees and Random Forests: Using Decision Trees and Random Forests to predict whether a lender will pay their loan back. Uses publically available data from LendingClub.com
  - Movie Recommendations using Recommender Systems: A micro project to build a recommendation system that makes movie recommendations based on user review similarities.
- R
  - ML Logistic Regression: Predicting salary class of a person using logistic regression.
  - ML Decision Trees and Random Forests: Using Decision Trees and Random Forests to classify schools as Private or Public.

I also dabble in all other kinds of technology. You can find a general portfolio here.

If you liked what you saw, want to have a chat with me about the portfolio, work opportunities, or collaboration, shoot an email at contact@sajalsharma.com.

Support My Work

If this project inspired you, gave you ideas for your own portfolio or helped you, please consider buying me a coffee ❤️.

rafeemusabbir/data-science-projects

Data Science Portfolio

The R portfolio is located here.

Instructions for Running Python Notebooks Locally

Contents

Machine Learning

Natural Language Processing

Data Analysis and Visualisation

Micro Projects:

Support My Work