Data Engineer Resources

💡 Hey there! Welcome to our Data Engineering Resources!

We've got you covered whether you're just starting out or you're already a pro.

In this guide, we'll take you through the basics of Data Engineering, to then move on to more complicated topics. So, whether you're a complete newbie or you're an expert looking to brush up on your skills, this guide is perfect for you. Let's get started!

🏁 Let’s start with Basics! (Beginner level)

🚀 Introduction To Big Data Analysis

Data Engineers come into play when companies needs to handle in a reproducible way large amounts of data so that they can then be streamlined to data analysts and scientists in order to understand past trends and make predictions.

But first of all what is Big Data?

Let's check some Big Data Analysis Resources! 👉🏻 https://pierpaolo28.github.io/blog/blog10/

When working with large amounts of data it can then become impossible to perform any form of analysis on a single machine and that’s why using hardware accelerators (e.g. GPUs, IPUs, etc…) or parallelizing execution across a cluster of machines is fundamental.

Gpu Accelerated Data Analytics & Machine Learning 👉🏻 https://pierpaolo28.github.io/blog/blog12/

Once having become familiar with the different kinds of acceleration techniques, it can then be possible to create ad hoc benchmarks to evaluate which approach might be best for your own use case.

Benchmarking Machine Learning Execution Speed 👉🏻 https://pierpaolo28.github.io/blog/blog33/

A good resource in order to gain a birds-eye view of the Data Engineering landscape is “Fundamentals of Data Engineering” by Joe Reis and Matt Housley or the “Data Engineering Zoomcamp” by Alexey Grigorev.

Fundamentals of Data Engineering 👉🏻 https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/

GitHub - DataTalksClub/data-engineering-zoomcamp at hackernoon.com 👉🏻 https://github.com/DataTalksClub/data-engineering-zoomcamp?ref=hackernoon.com

🐍 Python

Nowadays Python is indubitably one of the most ubiquitous programming languages when working with Data. Therefore it vital to have some familiarity with it in order to work in the data space:

Learn Python - Free Interactive Python Tutorial 👉🏻 https://www.learnpython.org/

Once having learned the basics, we can then start focusing on how to write efficient code in Python:

Efficient Python Tricks and Tools for Data Scientists — Effective Python for Data Scientists 👉🏻 https://khuyentran1401.github.io/Efficient_Python_tricks_and_tools_for_data_scientists/README.html

In order to create reproducible code and avoid libraries versioning issues, it is then necessary to understand how environments can be created and managed in Python and when using Anaconda:

Python Environments Management 👉🏻 https://pierpaolo28.github.io/blog/tips/Python-Environments-Management/

Anaconda Management 👉🏻 https://pierpaolo28.github.io/blog/tips/Anaconda-Management/

💾 SQL

As a Data Engineer, you might act as an interface between Database Engineers and the Data Analysts/Scientists teams, therefore a good knowledge of SQL to interact with databases can be expected.

Sql For Data Science 👉🏻 https://pierpaolo28.github.io/blog/blog22/

If you are new to SQL, you can lean about all the basics here: SQLBolt - Learn SQL - Introduction to SQL 👉🏻 https://sqlbolt.com/

There exist many different types of database systems (e.g. MySQL, Postgresql), each with its own specific dialect, therefore having cheatsheets with a summary of the key commands can be of great help when switching from one system to another.

Postgresql Commands 👉🏻 https://pierpaolo28.github.io/blog/tips/PostgreSQL-Commands/

BONUS POINT:

If you want to dig deeper on how SQL works, why not create your own dialect?

Check Out GomorraSQL — A Library To Write Queries In Neapolitan 👉🏻 https://betterprogramming.pub/check-out-gomorrasql-a-library-to-write-queries-in-neapolitan-3e85568dddb4

Now that you know how to use SQL to retrieve and manipulate data from databases, what if you get asked to design one or to create new content?

There exist different databases design practices and considerations to be taken into account.

You can learn more about them here:

Database design basics - Microsoft Support 👉🏻 https://support.microsoft.com/en-us/office/database-design-basics-eb2159cf-1e30-401a-8084-bd4f9c9ca1f5)

Designing your database schema — best practices 👉🏻 https://towardsdatascience.com/designing-your-database-schema-best-practices-31843dc78a8d)

Additionally, databases can be not just relational but also non-relational in order to be more flexible to store different kinds of data. Although each approach comes with its own advantages and limitations.

Relational VS Non Relational Database 👉🏻 https://www.youtube.com/watch?v=E9AgJnsEvG4&ab_channel=IBMTechnology

Relational VS Nonrelational Databases – the Difference Between a SQL DB and a NoSQL DB 👉🏻 https://www.freecodecamp.org/news/relational-vs-nonrelational-databases-difference-between-sql-db-and-nosql-db/

⚡️ Apache Spark

Apache Spark is one of the most popular frameworks to work with big data and therefore used by a large number of big organizations.

Apache Spark can support different coding interfaces such as Python, SQL, Java, and Scala.

Getting Started With Apache Spark 👉🏻 https://pierpaolo28.github.io/blog/blog73/

Apache Spark Optimization Techniques 👉🏻 https://pierpaolo28.github.io/blog/blog74/

For a complete guide on “Learning Spark”, Databricks also offers the complete book for free 👉🏻 https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf)

☁️ Cloud Computing

As more and more companies move from using on-premises resources to the cloud, it has become more and more important to learn about Cloud Computing.

Cloud Foundations For Data Scientists 👉🏻 https://pierpaolo28.github.io/blog/blog51/

If interested in learning more about Docker and Kubernetees, cheatsheets are available below:

Docker Commands 👉🏻 https://pierpaolo28.github.io/blog/tips/Docker-Commands/

Kubernetes Commands 👉🏻 https://pierpaolo28.github.io/blog/tips/Kubernetes-Commands/

While working with cloud-based systems, there are different paradigms which can be used in order to store data such as Databases, Data Warehouses and Data Lakes. You can learn more about them in the following article:

Data Stack for Machine Learning - Made With ML 👉🏻 https://madewithml.com/courses/mlops/data-stack/

The 3 most popular cloud computing platoforms are Microsoft Azure, Amazon Web Serivces (AWS) and Google Cloud Platform (GCP).

Each of these 3 providers offers a wide range of free course and documentation material in order to get you started with your learning some example are:

Microsoft Learn for Azure 👉🏻 https://learn.microsoft.com/en-us/training/azure/
AWS Training 👉🏻 https://www.aws.training/
Google Cloud Skills Boost 👉🏻 https://www.cloudskillsboost.google/

🚀 Let's increase the difficulty (Medium level)

Batch vs Streaming Pipelines

Depending on the project you are working on, new data might either arrive constantly (stream) or arrive in sporadic condensed regular intervals (batch). It is important to understand the difference between these 2 scenarios and how they can affect the design of your data pipeline.

What is a data pipeline | IBM 👉🏻 https://www.ibm.com/topics/data-pipeline

Intro To Batch Vs Stream Processing - With Examples 👉🏻 https://www.montecarlodata.com/blog-stream-vs-batch-processing/

Batch Processing vs Stream Processing 👉🏻 https://www.youtube.com/watch?v=A3Mvy8WMk04&ab_channel=TechPrimers

ETL vs ELT

Depending on if you are working on a transactional or analytical workflow and different other considerations you might then have to decide if to transform first your data (ETL) or load it in a staging area (ELT). You can learn more about these 2 approaches in the following articles:

Data Engineering: Transactional vs Analytical Workloads 👉🏻 https://medium.com/@guxie/data-engineering-transactional-vs-analytical-workloads-ab1a03832b2c

ETL vs ELT: Key Differences, Side-by-Side Comparisons, & Use Cases 👉🏻 https://rivery.io/blog/etl-vs-elt/

GIT & DevOps

As part of the role of a Data Engineer, it is fundamental to be able to create reproducible workflows.

In order to ensure this at a code repository level Git is a fundamental system, you can learn more about it in the following guide:

What is Git | Atlassian Git Tutorial 👉🏻 https://www.atlassian.com/git/tutorials/what-is-git

A summary of some of the key commands is instead available here:

Basic Git Workflow 👉🏻 https://pierpaolo28.github.io/blog/tips/Basic-Git-Workflow/

Taking this a steps further, some of the common principles can also be applied to versioning and governing your own data:

Data Version Control · DVC 👉🏻 https://dvc.org/

Finally, DevOps can be used to align development and operations teams to improve the overall quality and delivery of your projects.

What is DevOps? | Atlassian 👉🏻 https://www.atlassian.com/devops

Testing

Another key aspect of software development is ensuring your solution really does what you expect and that there are no hidden bugs in your service or between the interaction of different services. There are different testing techniques which can be used in order to identify these kind of issues, some of the most popular ones are outlined here:

Software Development Testing 👉🏻 https://pierpaolo28.github.io/blog/tips/Software-Development-Testing/

For a more in depth guide on how to use testing in Python, this tutorial can be of your case.

Getting Started With Testing in Python – Real Python 👉🏻 https://realpython.com/python-testing/

Feature Preparation / Management

Once designed a solid Data Engineering pipeline, different end users might require specific columns (features) for their analysis therefore it can be useful to know for Data Engineers how Data Scientists design their data to be fed into Machine Learning (ML) Models.

Feature Engineering Techniques 👉🏻 https://pierpaolo28.github.io/blog/blog30/

Feature Extraction Techniques 👉🏻 https://pierpaolo28.github.io/blog/blog29/

Feature Selection Techniques 👉🏻 https://pierpaolo28.github.io/blog/blog27/

Once an ML model reaches the production stage, Data Engineers and Machine Learning Engineers will have to design a way to reliable process new data as it comes so that to make predictions and retrain the ML model on a regular basis. Feature Stores are one of the most common approaches in order to create reusable solutions for this scenario:

Getting Started With Feature Stores 👉🏻 https://pierpaolo28.github.io/blog/blog67/

Data Visualization

In order to check the quality of our data pipeline and communicate with non-technical stakeholders, it can be quite useful for Data Engineers to be able to design easy-to-understand data visualizations to communicate key insights about the data or the system architecture we are working with.

Interactive Data Visualization 👉🏻 https://pierpaolo28.github.io/blog/blog11/

Interactive Dashboards For Data Science 👉🏻 https://pierpaolo28.github.io/blog/blog21/

🎱 On-the-job tools (Advanced level)

Platforms

Some of the most common data platforms which are used in the industry nowadays are:

1. Databricks
2. Snowflake
3. Palantir Foundry
4. Firebolt
5. Dremio
6. Native cloud technologies: AWS (Redshift, Glue), Azure (Synapse, Data Factory), GCP (e.g. Bigquery, DataFlow, DataProc)

Becoming familiar with these kinds of tools could then be of help in order to be more employable.

Libraries & Tools

Some additional examples of data engineering specific libraries and tools which could be good to know are:

1. DBT
2. Data Expectations
3. Apache Airflow
4. Prefect
5. Dagster

A good resource to keep updated about the latest libraries and tools in the data space can be the Seattle Data Guy newsletter. 👉🏻 https://seattledataguy.substack.com/

Untitled 👉🏻 https://s3-us-west-2.amazonaws.com/secure.notion-static.com/0ac4f834-343f-4b48-9a8a-6a6d0cd23c2c/Untitled.png

Source 👉🏻 https://lakefs.io/blog/the-state-of-data-engineering-2022/