Data Engineering Learning Guide

Created by: Kelvin Oyanna
Email: dotkelplus@gmail.com
Linkedin: https://www.linkedin.com/in/oyannakelvin/
Twitter: @kelvinoyanna

About Data Engineering

Data engineering is a specialization in the field of data that concerns building scalable data infrastructure/pipelines that aggregates data from multiple sources & consolidate into an Analytics Data warehouse to support organization-wide analytics/reports used by the data analysts, data scientists or the BI team.

Must-have skills as a Data Engineer

Python:

SQL:

https://mode.com/sql-tutorial/introduction-to-sql/

Database/Data modeling:

Get the book - The Data Warehouse Toolkit.

Cloud Infrastructure:

Learn Cloud fundamentals (Google cloud or AWS cloud)
Practice your knowledge of cloud engineering with a cloud sandbox: https://kodekloud.com/

Advanced Skill

Building ETL pipelines using:

Open-source tools (Python, SQL & Apache Airflow). See example walk through on this here: https://www.startdataengineering.com/

Automating & monitoring data pipelines using:

Writing cron jobs, Apache Airflow (most recommended)

Building ELT Data Pipelines:

Get started with data ingestion using Airbyte: https://docs.airbyte.com/using-airbyte/getting-started/
Learn data transformation using DBT: https://docs.getdbt.com/guides/manual-install?step=1

Learn Big data processing framework (This is optional for Beginners):

Apache Spark (for big data transformation, & building streaming data pipeline). Get the book: Spark The Definitive Guide.
Apache Kafka (for large-scale data streaming pipeline).
Docker for containerizing your data pipeline.
Git - For version control & remote collaboration.
Kubernetes for data pipeline deployment