Jithendra's Data Science Portfolio

Here is an exhaustive list of my projects -

Stand-alone projects

FPGA Neural Network Accelerator

We designed a Neural Network Accelerator for Darknet Reference Model (which is 2.9 times faster than AlexNet and attains the same top-1 and top-5 performance as AlexNet but with 1/10th the parameters) for image classification on Imagenet Dataset on Intel Cyclone V Soc FPGA, while working as a part-time undergrad researcher under guidance of Prof. Vinod Kumar Jain. When connected to ARM Cortex A9 processor using OpenCL framework, it achieved around 300% faster inference speed than CPU.

Contributor: Tirumal Naidu

AWS SageMaker - Fraud Detection service

The goal of this project is to underdstand the complete machine learning work flow (from data collection, data storing, data preprocessing, model selection, training and finally to model deployement) using AWS SageMaker. I built an end to end fraud detection service system using services provided by AWS. Trained machine learning job and deployed model using SageMaker, created endpoint that can be invoked by Lambda, created API with API Gateway in order to send request to flask application. Deployed the application on AWS Cloud9 environment and finally integrated the application with SNS service to alert client by sending email when fraud is detected.

Early detection of Autism in toddlers

Studied various approaches to identify autism spectrum disorder(ASD) traits in toddlers. Designed a system which analyses gaze patterns for early detection of Autism. The system accurately predicted whether a child has autism 62% of the time. Studied various pros and cons of using gaze as a measure of Autism screening in toddlers.

Classification problems

Heartbeat anomaly detection

Detected anomalies in heartbeats using LSTM Auto-encoder. The dataset used contains 5000 time series sequences with 140 timestamps obtained with ECG and corresponds to heartbeats from a single patient. Trained and evaluated autoencoder, chose a threshold for anomaly detection and finally classified unseen examples as normal or anomaly.

Credit card fraud detection

Identified fraudulent credit card transactions in a highly imbalanced dataset using oversampling methods (SMOTE) and ensemble learning model (Random Forest).

Titanic: Machine Learning from Disaster

Titanic: Machine Learning from Disaster is a knowledge competition on Kaggle. Like many others, I started practicing machine learning with this. Various versions of notebooks and approaches can be found in this github repo.

Quantitative Analysis

Buy or Sell Stocks? - Dual Moving Average Crossover (DMAC) trading strategy

Predicted when to buy or sell stocks using simple dual moving average crossover strategy. And then backtested it over 5 years of stock. I used Yahoo! finance data downloader to download the stocks of Maruti Suzuki (MARUTI.NS). A return of 113% in 5 years estimated by DMAC strategy with short and long windows 13 and 48 respectively. However - It is up to the trader to choose the number of days to which the two moving averages are set. This should be done after testing and evaluating the system thoroughly in the recommended way, using the trader’s method.

Regression Problems

Predicting chance of admission for MS applications

It was almost admission season. I’ve got a couple of friends who are preparing for GRE. This made me to try What could be their chance of admit and how it may vary with other parameters? In this notebook, I used the dataset mentioned in this paper and tried to predict the chance of admit based on different parameters. Before modeling and predicting, I performed Eploratory Data Analysis on the data to get some insights on MS admissions. I used PyCaret for modeling.

Time Series Problems

Forecasting Air pollution

In this project, I explored various models for forecasting time series. I then compared the performance of the models over two different metrics. I forecasted the amount of pollution in air based on the historical pollution data. I used Beijing pollution public dataset - which contains data from 2010-14, along with extra weather features such as temperature, windspeed, pressure etc.

Stock data analysis and forecasting

This is a playground project where I explored time series data of historical stock prices of some publicly listed companies. Stock data was collected using Pandas Datareader with the help of Tiingo API. Experimented with Long Short Term Memory(LSTM) networks and Facebook's Prophet to forecast the stocks.

Forecasting Monthly robberies

This is a playground project where I played with ARIMA model for forecasting monthly robberies in Boston. I manually configures ARIMA, then grid-searched the ARIMA parameters and aslo played with data transformations.

NLP projects

Using News headlines to predict stock movements

Predicted whether Dow Jones Industrial Average (DJIA's) Adj. Close value raises or decreases based on sentiment of top news headlines. The dataset I used for this analysis is from kaggle. I used VADER sentiment analysis package which is a lexicon and rule-based sentiment analysis tool. I calculated polarity and subjectivity of news headlines for everyday and used them as features along with the stock features. Then I modeled the data with various classifiers and Linear Discriminant Analysis (LDA) classifier gave the better accuracy.

Computer Vision Projects

Reading Captchas

This is a playground computer vision project where I experimented with Lenet using Keras to detect and read the numbers from captcha images.

Data Analysis

Analysis my personal Spotify streaming history

I know that my music taste changed a lot in past few years, So I wanted to see how it changed over time. I collected my streaming history using Spotipy which is a light weight client to extract many features from Spotify's web API. I analyzed my spotify streaming history to understand how my music taste is varying over time. This helped me to find out my top songs, artists, genres and time I am spending on each of them.

Anslysis of my favorite artist's discography

I analyzed Tyler, the creator's music discography to see how his music varied for every album. This helped me to find - which album is more energetic, which one is more danceable and many other cool insights. I repeated the same analysis for "Black Sabbath" too.

Whatsapp group chat analysis

In this fun project, I decided to try my hands on text data for the 1st time. I analysed the text data from a whatsapp group of my friends! I performed some basic cleaning and then analysis. Whatsapp has an option export the chat into .txt file! I used that to extract the group chat messages!

Unsupervised Problems

Clustering songs based on features

The goal of this to automatically divide a whole playlist of songs into to different playlists of different moods/features like - energetic songs and relaxing songs! I extracted the whole discography of my favorite arists into a csv file using Spotipy, used KMeans algorithm to cluster all the songs of an artist into two clusters - Relaxing and Energetic using 3 features - Energy, Danceability and Loudness. And then I added the clusters back to my spotify library as seperate playlists.

Recommendation systems

Simple Content-Based Song Recommender

There are several approaches to build such systems and one of them is Content-Based approach. This notebook demonstrates a simple content-based recommendation for songs.

jithendray/data-science-portfolio