Data engineering - AI22

The aim of this course is to learn various data engineering concepts both theoretically and through implementations using various technologies. Data engineering is a very important topic where the focus is on ingesting data into a data pipeline and transforming it in different ways to serve and enable downstream works. Ideally roles such as data scientists, data analysts and business intelligence should not worry about where to find data and how to transform it, but instead work in their respective specialist domains.

This course builds upon previous skills in:

pandas, numpy
data visualisation tools: matplotlib, seaborn and plotly
git and github

This course repo contains all lecture codes, lecture slides and exercises.

Schedule

The schedule is divided into two parts: pre-summer and post-summer, where the first part focuses heavily on theory and core knowledge and the second part you will use what you've learnt to apply in a realistic project.

Week	Content
21	course intro, why data engineering, Linux, data pipelines, docker intro, dockerfile, study visit Ericsson
22	containers, docker-compose, data engineering lifecycle, ETL, Airflow intro, DAGs, operators, tasks, study visit Olsaro
23	orchestrating data pipeline, xcom, connect to postgres, serve downstream dashboard and ML
33	continue ETL and ELT pipeline orchestrating with Airflow, guest lecture agile theory
34	project start, data version control (DVC), github actions CI/CD, pre-commit
35	project, Intro to Azure and deploy a pipeline to Azure.
36	project, ethics and data security - guest lecture?
37	project - presentation and report
38	inspiration: spark, databricks, data lake, data factory, ETL pipeline on Azure. Modern data stack - study visit Knowit

Note that this schedule is an overview and will be updated during the course.

SCGlass/Data-engineering-AI22

Data engineering - AI22

Schedule