epfl-dslab: A Jupyter Notebook repository from yehchunhung

Description

This hands-on course teaches the tools & methods used by data scientists, from researching solutions to scaling up prototypes to Spark clusters. It exposes the students to the entire data science pipeline, from data acquisition to extracting valuable insights applied to real-world problems.

Students work in groups of 4 on big data science problems of the kind typically faced in the industry. There are four graded homeworks and a final project with a video presentation.

Questions

Questions and discussions about the course are gathered on Slack: https://epfl-dslab2020.slack.com. You will receive an invitation to join the workspace.

Final Project

Final assignment: description

Lab Sessions

Week 1 - 19.02.2020

module 1
- Jupyter Notebooks
- Python 3.x
- NumPy, Pandas, Matplotlib, Scikit-Learn
slides (pdf):
- lab
- industry
exercises
- week 1

Week 2 - 26.02.2020

module 1
- Reproducible data science
- Git, Docker, Renku
slides (pdf):
- lab
exercises (EPFL access required)
- week 1 solutions
- week 2

Week 3 - 04.03.2020

module 2
- Introduction to big data, best practices and guidelines
- Loading & querying data with Hadoop
- HDFS, Hive
slides (pdf):
- lab
- industry
exercises
- week 2 solutions
- week 3

Week 4 - 11.03.2020

module 2
- Data wrangling with Hadoop
slides (pdf):
exercises
- week 3 solutions
- week 4
assessed project
- homework 1 instructions

Week 5 - 18.03.2020

module 2
- Introduction to distributed computing and the Spark runtime architecture
- Python on Spark
- Basic RDD manipulations
slides:
- lab web
- lab pdf
- No industy talk this week
exercises
- week 4 solutions
- week 5

Week 6 - 25.03.2020

module 3
- Spark data frames
slides (pdf):
- lab web
- industry
exercises
- week 5 solutions
- week 6
assessed project
- homework 1 due before 00:00 CET
- homework 1 solutions
- homework 2 instructions

Week 7 - 01.04.2020

module 3
- Advanced Spark, optimizations and partitioning
slides (pdf):
- lab
- No industry talk this week
exercises
- week 6 solutions
- No explicit exercise this week, however you can extend the covid demo project and do some basic data science on an important topic!

Week 8 - 08.04.2020

module 3
- Advanced Spark, optimizations and partitioning
- Practical exercises with Twitter, SBB data and partitioning
slides:
- Lab is in the form of an exercise notebook
- industry
exercises
- week 8
assessed project
- homework 1 grades
- homework 2 due before 00:00 CEST
- homework 2 solutions
- homework 3 instructions

Easter break! - 15.04.2020

Week 9 - 22.04.2020

module 4
- Introduction to data stream processing
- Apache Kafka for stream processing
slides:
exercises
- week 8 solutions
- week 9

Week 10 - 29.04.2020

module 4
- Advanced data stream processing concepts on Spark with Kafka
slides:
- lab
- industry
exercises
- week 9 solutions
assessed project
- homework 2 grades
- homework 3 due before 23:59 CEST
- homework 3 solutions
- homework 4 instructions

Week 11 - 06.05.2020

module 4
- Data in motion and data at rest
slides (pdf):
- lab
exercises
- week 10 solutions
- week 11

Week 12 - 13.05.2020

final assignment
- Useful tips and hints
slides (pdf):
- lab
exercises
- week 11 solutions
assessed project
- homework 3 grades
- homework 4 due before 00:00 CEST
- homework 4 solutions
- final assignment presentation

Week 13 - 20.05.2020

final assignment
- Q&A office hours

Week 14 - 24.05.2020 - 27.05.2020

final assignment (25.05 noon)
- 7 min (max) video and notebook due by midnight
final assignment (27.05)
- Oral Q&A (video calls of 6min per group)
assessed project (27.05)
- homework 4 grades available

yehchunhung/epfl-dslab