This hands-on course teaches the tools & methods used by data scientists, from researching solutions to scaling up prototypes to Spark clusters. It exposes the students to the entire data science pipeline, from data acquisition to extracting valuable insights applied to real-world problems.
Students work in groups of 4 on big data science problems of the kind typically faced in the industry. There are four graded homeworks and a final project with a video presentation.
Questions and discussions about the course are gathered on Slack: https://epfl-dslab2020.slack.com. You will receive an invitation to join the workspace.
- Final assignment: description
- module 1
- Jupyter Notebooks
- Python 3.x
- NumPy, Pandas, Matplotlib, Scikit-Learn
- slides (pdf):
- exercises
- module 1
- Reproducible data science
- Git, Docker, Renku
- slides (pdf):
- exercises (EPFL access required)
- module 2
- Introduction to big data, best practices and guidelines
- Loading & querying data with Hadoop
- HDFS, Hive
- slides (pdf):
- exercises
- module 2
- Data wrangling with Hadoop
- slides (pdf):
- exercises
- assessed project
- homework 1 instructions
- module 2
- Introduction to distributed computing and the Spark runtime architecture
- Python on Spark
- Basic RDD manipulations
- slides:
- exercises
- module 3
- Spark data frames
- slides (pdf):
- exercises
- assessed project
- homework 1 due before 00:00 CET
- homework 1 solutions
- homework 2 instructions
- module 3
- Advanced Spark, optimizations and partitioning
- slides (pdf):
- lab
- No industry talk this week
- exercises
- week 6 solutions
- No explicit exercise this week, however you can extend the covid demo project and do some basic data science on an important topic!
- module 3
- Advanced Spark, optimizations and partitioning
- Practical exercises with Twitter, SBB data and partitioning
- slides:
- Lab is in the form of an exercise notebook
- industry
- exercises
- assessed project
- homework 1 grades
- homework 2 due before 00:00 CEST
- homework 2 solutions
- homework 3 instructions
- module 4
- Introduction to data stream processing
- Apache Kafka for stream processing
- slides:
- exercises
- module 4
- Advanced data stream processing concepts on Spark with Kafka
- slides:
- exercises
- assessed project
- homework 2 grades
- homework 3 due before 23:59 CEST
- homework 3 solutions
- homework 4 instructions
- module 4
- Data in motion and data at rest
- slides (pdf):
- exercises
- final assignment
- Useful tips and hints
- slides (pdf):
- exercises
- assessed project
- homework 3 grades
- homework 4 due before 00:00 CEST
- homework 4 solutions
- final assignment presentation
- final assignment
- Q&A office hours
- final assignment (25.05 noon)
- 7 min (max) video and notebook due by midnight
- final assignment (27.05)
- Oral Q&A (video calls of 6min per group)
- assessed project (27.05)
- homework 4 grades available