Data engineering - AI22

The aim of this course is to learn various data engineering concepts both theoretically and through implementations using various technologies. Data engineering is a very important topic where the focus is on ingesting data into a data pipeline and transforming it in different ways to serve and enable downstream works. Ideally roles such as data scientists, data analysts and business intelligence should not worry about where to find data and how to transform it, but instead work in their respective specialist domains.

This course builds upon previous skills in:

  • pandas, numpy
  • data visualisation tools: matplotlib, seaborn and plotly
  • git and github

This course repo contains all lecture codes, lecture slides and exercises.


Schedule

The schedule is divided into two parts: pre-summer and post-summer, where the first part focuses heavily on theory and core knowledge and the second part you will use what you've learnt to apply in a realistic project.

Week Content
21 course intro, why data engineering, Linux, data pipelines, docker intro, dockerfile, study visit Ericsson
22 containers, docker-compose, data engineering lifecycle, ETL, Airflow intro, DAGs, operators, tasks, study visit Olsaro
23 orchestrating data pipeline, xcom, connect to postgres, serve downstream dashboard and ML
33 continue ETL and ELT pipeline orchestrating with Airflow, guest lecture agile theory
34 project start, data version control (DVC), github actions CI/CD, pre-commit
35 project, Intro to Azure and deploy a pipeline to Azure.
36 project, ethics and data security - guest lecture?
37 project - presentation and report
38 inspiration: spark, databricks, data lake, data factory, ETL pipeline on Azure. Modern data stack - study visit Knowit

Note that this schedule is an overview and will be updated during the course.