Created by: Kelvin Oyanna
Email: dotkelplus@gmail.com
Linkedin: https://www.linkedin.com/in/oyannakelvin/
Twitter: @kelvinoyanna
Data engineering is a specialization in the field of data that concerns building scalable data infrastructure/pipelines that aggregates data from multiple sources & consolidate into an Analytics Data warehouse to support organization-wide analytics/reports used by the data analysts, data scientists or the BI team.
Python:
SQL:
Database/Data modeling:
- Get the book - The Data Warehouse Toolkit.
Cloud Infrastructure:
- Learn Cloud fundamentals (Google cloud or AWS cloud)
- Practice your knowledge of cloud engineering with a cloud sandbox: https://kodekloud.com/
Building ETL pipelines using:
- Open-source tools (Python, SQL & Apache Airflow). See example walk through on this here: https://www.startdataengineering.com/
Automating & monitoring data pipelines using:
- Writing cron jobs, Apache Airflow (most recommended)
Building ELT Data Pipelines:
- Get started with data ingestion using Airbyte: https://docs.airbyte.com/using-airbyte/getting-started/
- Learn data transformation using DBT: https://docs.getdbt.com/guides/manual-install?step=1
Learn Big data processing framework (This is optional for Beginners):
- Apache Spark (for big data transformation, & building streaming data pipeline). Get the book: Spark The Definitive Guide.
- Apache Kafka (for large-scale data streaming pipeline).
- Docker for containerizing your data pipeline.
- Git - For version control & remote collaboration.
- Kubernetes for data pipeline deployment
https://www.dataquest.io/path/data-engineer/
Follow this link to access DE projects to work on: https://www.ssp.sh/brain/open-source-data-engineering-projects/
Join https://www.reddit.com/r/dataengineering/
Follow & watch the videos on this Data Engineering Bootcamp:
https://github.com/DataTalksClub/data-engineering-zoomcamp