Data Science Bootcamp
My notebooks, learnings and results from a 12 week Data Science course at Spiced Academy
Week 1: Data Wrangling
- Machine Learning Workflow: Steps how to approach a new dataset
- Data wrangling (pandas)
- Technical and design aspects of plotting data (matplotlib, seaborn)
- Descriptive statistics
- Pivoting / Wide and narrow data (pandas)
Project: Recreation of the famous animated scatterplot by Hans Rosling.
Presentation: Essentials of clean code inspired by Uncle Bob (Robert C. Martin).
Week 2: Classification Problem
- Data exploration, cleaning, imputation
- Feature enginerring
- Encoding strategies e.g. one-hot, ordinal, etc
- Polynomial and interaction terms
- Designing preprocessing pipelines (sklearn)
- Logistic regression in math and application (sklearn)
- Evaluation of classifiers: Cross-Validation, Precision, Recall, F1-score
Project: Titanic survival prediction (kaggle competition) using basic feature engineering and logistic regression.
Presentation: About the different kinds of correlations (pearson, kendall & spearman).
Week 3: Regression Problem
- Math and implementation of gradient descent algorithm for linear regression.
- Linear regression
- Regularization strategies: Lasso (L1), Ridge (L2), ElasticNet
- Model evaluation: R2 score
- Feature expansion
- Hyperparameter optimization (sklearn)
Project: Prediction of bike sharing demand (kaggle competition). Submission to kaggle.
Presentation: 3D surface plot using matplotlib to get an intuition on loss functions and the gradient descent approach.
Week 4: Naive Bayes Classification and NLP
- Naive Bayes classification
- Theory
- Application (sklearn)
- Natural Language Processing (NLP)
- Vectorization of text: Bag-of-words, TF-IDF
- Class balancing strategies
- Web scraping, parsing, regular expressions, scrapy
Project: Classification (Multinomial Naive Bayes) of a text phrase to a musical artist based on their lyrics.
Presentation: How I gathered lyrics data with an own scraper based on scrapy.
Week 5: Dashboards, Cloud & Databases
- Relational databases (PostreSQL), Data modeling, SQL (Python SQLAlchemy)
- Cloud computing on AWS
- Unix administration basics
- Setup PostgreSQL on AWS
- Setup Metabase dashboard on AWS (EC2)
Project: Metabase dashboard deployed on AWS.
Presentation: Used lyrics data from week 4 to create a clustered map of songs using the ForceAtlas2 algorithm available in Gephi. It uses physical modelling of masses and springs to visualize a graph. The approach failed most likely due to curse of dimensionality.
Week 6: ETL Pipeline, Sentiment Analysis of Tweets
- Basics of Sentiment Analysis (NLP)
- ETL (Extract, Transform, Load) pipeline with Docker Compose
- MongoDB NoSQL database
- Web APIs (Twitter, Slack)
Project: Processing pipeline fetching tweets in real time from Twitter for a given Keyword and perform a sentiment analysis on them. Pipeline: Fetch (Extract) tweets -> Store in MongoDB -> Load new tweets -> Perform sentiment analysis (Transform) -> Store tweet along with sentiment score in PostgreSQL.
Presentation: Debugging. What challenges I faced during the project and how I solved them.
Week 7: Time Series
- ARIMA Model, (Partial) autocorrelation function
- Evaluating Forecasts
- Statistical distribution functions
Project: Manual step-by-step ARIMA modelling of weather data. Frequency analysis using signal processing techniques (Fourier transform).
Presentation: Introduction to sonification of time series data.
Week 8: Markov Chain Monte Carlo (MCMC)
- Markov Chains & Monte Carlo simulations
- Linear Algebra
- OpenCV
- Software Design, OOP & Code Style
Project: Simulation of an average day in a supermarket based on the data analysis of its customer data. Animated visualization using OpenCV of simulated customers.
Presentation: Software design of the project.
Week 9: Artificial Neural Networks & Deep Learning
- Feed-Forward Neural Networks
- Convolutional Neural Networks for Image Processing
- Backpropagation
- Transfer Learning
- Recurrent Neural Networks
- Keras (TensorFlow Interface)
Project: Image classification using transfer learning on a pretrained network (MobileNetV2).
Presentation: Understanding the backpropagation algorithm by using a graph approach of the chain rule.
Week 10: Unsupervised Learning & Recommender Systems
- Principal Component Analysis (PCA)
- t-SNE
- Non-negative Matrix Factorization (NMF)
- Nearest neighbour approaches
- Clustering
- Cosine similarity
- Web App
- Python Flask, Jinja & Bootstrap
- Deployment on Heroku
Project: Movie recommender based on cosine similarity approach and a movie rating data set. See deployed web app at mega-movie-recommender.herokuapp.com
Presentation: Details of my approach for the movie recommender.
Week 11-12: Final Project
Topic Modelling on Bundestag Speeches (German Parliament). See github.com/raphaelw/nlp-bundestag