/data-science-bootcamp

My notebooks, learnings and results from a 12 week Data Science course at Spiced Academy

Primary LanguageJupyter Notebook

Data Science Bootcamp

My notebooks, learnings and results from a 12 week Data Science course at Spiced Academy

Week 1: Data Wrangling

  • Machine Learning Workflow: Steps how to approach a new dataset
  • Data wrangling (pandas)
  • Technical and design aspects of plotting data (matplotlib, seaborn)
  • Descriptive statistics
  • Pivoting / Wide and narrow data (pandas)

Project: Recreation of the famous animated scatterplot by Hans Rosling.

Presentation: Essentials of clean code inspired by Uncle Bob (Robert C. Martin).

Week 2: Classification Problem

  • Data exploration, cleaning, imputation
  • Feature enginerring
    • Encoding strategies e.g. one-hot, ordinal, etc
    • Polynomial and interaction terms
  • Designing preprocessing pipelines (sklearn)
  • Logistic regression in math and application (sklearn)
  • Evaluation of classifiers: Cross-Validation, Precision, Recall, F1-score

Project: Titanic survival prediction (kaggle competition) using basic feature engineering and logistic regression.

Presentation: About the different kinds of correlations (pearson, kendall & spearman).

Week 3: Regression Problem

  • Math and implementation of gradient descent algorithm for linear regression.
  • Linear regression
    • Regularization strategies: Lasso (L1), Ridge (L2), ElasticNet
    • Model evaluation: R2 score
  • Feature expansion
  • Hyperparameter optimization (sklearn)

Project: Prediction of bike sharing demand (kaggle competition). Submission to kaggle.

Presentation: 3D surface plot using matplotlib to get an intuition on loss functions and the gradient descent approach.

Week 4: Naive Bayes Classification and NLP

  • Naive Bayes classification
    • Theory
    • Application (sklearn)
  • Natural Language Processing (NLP)
    • Vectorization of text: Bag-of-words, TF-IDF
  • Class balancing strategies
  • Web scraping, parsing, regular expressions, scrapy

Project: Classification (Multinomial Naive Bayes) of a text phrase to a musical artist based on their lyrics.

Presentation: How I gathered lyrics data with an own scraper based on scrapy.

Week 5: Dashboards, Cloud & Databases

  • Relational databases (PostreSQL), Data modeling, SQL (Python SQLAlchemy)
  • Cloud computing on AWS
    • Unix administration basics
    • Setup PostgreSQL on AWS
    • Setup Metabase dashboard on AWS (EC2)

Project: Metabase dashboard deployed on AWS.

Presentation: Used lyrics data from week 4 to create a clustered map of songs using the ForceAtlas2 algorithm available in Gephi. It uses physical modelling of masses and springs to visualize a graph. The approach failed most likely due to curse of dimensionality.

Week 6: ETL Pipeline, Sentiment Analysis of Tweets

  • Basics of Sentiment Analysis (NLP)
  • ETL (Extract, Transform, Load) pipeline with Docker Compose
    • MongoDB NoSQL database
    • Web APIs (Twitter, Slack)

Project: Processing pipeline fetching tweets in real time from Twitter for a given Keyword and perform a sentiment analysis on them. Pipeline: Fetch (Extract) tweets -> Store in MongoDB -> Load new tweets -> Perform sentiment analysis (Transform) -> Store tweet along with sentiment score in PostgreSQL.

Presentation: Debugging. What challenges I faced during the project and how I solved them.

Week 7: Time Series

  • ARIMA Model, (Partial) autocorrelation function
  • Evaluating Forecasts
  • Statistical distribution functions

Project: Manual step-by-step ARIMA modelling of weather data. Frequency analysis using signal processing techniques (Fourier transform).

Presentation: Introduction to sonification of time series data.

Week 8: Markov Chain Monte Carlo (MCMC)

  • Markov Chains & Monte Carlo simulations
  • Linear Algebra
  • OpenCV
  • Software Design, OOP & Code Style

Project: Simulation of an average day in a supermarket based on the data analysis of its customer data. Animated visualization using OpenCV of simulated customers.

Presentation: Software design of the project.

Week 9: Artificial Neural Networks & Deep Learning

  • Feed-Forward Neural Networks
  • Convolutional Neural Networks for Image Processing
  • Backpropagation
  • Transfer Learning
  • Recurrent Neural Networks
  • Keras (TensorFlow Interface)

Project: Image classification using transfer learning on a pretrained network (MobileNetV2).

Presentation: Understanding the backpropagation algorithm by using a graph approach of the chain rule.

Week 10: Unsupervised Learning & Recommender Systems

  • Principal Component Analysis (PCA)
  • t-SNE
  • Non-negative Matrix Factorization (NMF)
  • Nearest neighbour approaches
    • Clustering
    • Cosine similarity
  • Web App
    • Python Flask, Jinja & Bootstrap
    • Deployment on Heroku

Project: Movie recommender based on cosine similarity approach and a movie rating data set. See deployed web app at mega-movie-recommender.herokuapp.com

Presentation: Details of my approach for the movie recommender.

Week 11-12: Final Project

Topic Modelling on Bundestag Speeches (German Parliament). See github.com/raphaelw/nlp-bundestag