Awesome Analytics Engineering.

Welcome to the "Awesome Analytics Engineering" repository! This curated collection of learning resources is designed to provide you with a structured learning path to excel as an Analytics Engineer.

Whether you're a beginner looking to enter the field or an experienced professional seeking to enhance your skills, this repository will guide you through various levels of analytics engineering expertise.

Analytics Engineering.
SQL.
Python.
dbt (data build tool.
Apache Airflow.
Cloud Computing Fundamentals.
Data Modeling and ware housing.
Great Expectation.
Data visualization.
Fivetran
Prefect.
Airbyte.
Talend.
Debugging.
Testing.
Version Control Systems and Data Control Systems.

1). Analytics Engineering

Analytics Engineering combines data engineering and analytics to transform raw data into actionable insights.

It involves designing and implementing data pipelines, data modeling, and building analytics solutions. Analytics Engineers collaborate with data scientists, analysts, and stakeholders to deliver data-driven solutions.

In this repository you will get hand picked learning resource and a structured learning path to excel as an Analytics Engineer.

Resources:

Text Tutorial: 1). What is analytics engineering by dbt Lab

2). Introduction to Analytics Engineering by Advanced Analytics

3). Analytics Engineer: The Newest Data Career Role by Madison Schott

4). What Is An Analytics Engineer? Everything You Need to Know by DataCamp

Also check this blog by Learn Analytics Engineering

Video Tutorial:

2). SQL

SQL (Structured Query Language) is a fundamental skill for analytics engineers. Here, you will find resources that cover SQL basics, advanced query techniques, data manipulation, and optimization. Mastering SQL is crucial for efficiently extracting, transforming, and analyzing data from relational databases.

Resources:

Text Tutorial:
Video Tutorial:

3). Python

Python is a versatile programming language widely used in analytics engineering. This level provides learning materials and examples to develop your Python skills for data manipulation, analysis, and visualization. You'll explore libraries like Pandas, NumPy, and Matplotlib to handle data effectively.

Resources:

Text Tutorial:

Python Documentation

Tutorial Net Python Tutorial

W3 School Python Tutorial

Video Tutorial:

4). dbt (data build tool)

dbt (data build tool) is a popular open-source tool for transforming and orchestrating data pipelines. In this section, you'll learn about dbt's features, how to define transformations, manage data models, and maintain a reliable data infrastructure.

5). Airflow

Apache Airflow is a powerful workflow management platform. Here, you will discover resources to understand Airflow's concepts, how to define and schedule tasks, manage dependencies, and create robust data pipelines. You'll also learn about best practices for orchestrating data workflows.

Resources:

Text Tutorial:
Video Tutorial:

6). Cloud Computing Fundamentals.

Cloud computing has revolutionized the way data is stored and processed. This level introduces you to cloud computing fundamentals, including the concepts of Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). You'll explore popular cloud platforms and their offerings for analytics engineering.

Resources:

Text Tutorial:
Video Tutorial:

7). Data Modeling and ware housing.

Data modeling and warehousing are essential components of analytics engineering. In this section, you'll delve into the principles of data modeling, relational and dimensional modeling, and learn about data warehousing concepts and technologies. Understanding these concepts will enable you to design efficient and scalable data architectures.

Resources:

Text Tutorial:
Video Tutorial:

8). Great Expectation.

Great Expectations is an open-source Python library that helps you define, document, and validate your data pipelines. With Great Expectations, you can set expectations about your data, such as its structure, type, and statistical properties. It allows you to check if your data meets these expectations, providing data quality assurance and helping to identify issues early in your pipeline.

Resources:

Text Tutorial:
Video Tutorial:
- Learn SQL In 60 Minutes by Web Dev Simplified
- SQL Tutorial - Full Database Course for Beginners by freeCodeCamp
- https://youtu.be/HXV3zeQKqGY?si=Tta3Ps5XxPcwCx1e by Programming with Mosh By using Great Expectations, you can establish a contract between data producers and consumers, ensuring that data is reliable, consistent, and adheres to predefined expectations. The library supports various data storage systems and formats, making it suitable for a wide range of data engineering use cases

9). Data visualization.

Data visualization is the process of presenting data in visual formats such as charts, graphs, and maps. It is a powerful tool for exploring and communicating insights from data. Effective data visualization allows you to identify patterns, trends, and outliers, enabling better decision-making.

There are numerous data visualization tools available that can help you create compelling visualizations. Some popular options include Tableau, Power BI, matplotlib, seaborn, and Plotly. These tools provide a wide range of features and customization options to create interactive and informative visualizations.

When working with data visualization, it's important to consider the target audience and the story you want to convey. Choosing appropriate visual representations, colors, and labels can significantly enhance the understanding and impact of your data visualizations.

Resources:

Text Tutorial:
Video Tutorial:

10). Fivetran.

Fivetran is a cloud-based data integration platform that simplifies the process of data ingestion from various sources into a data warehouse. It provides pre-built connectors for a wide range of data sources, including databases, cloud applications, event streams, and more.

With Fivetran, you can set up automated data pipelines to extract data from the source systems, transform it if necessary, and load it into your preferred data warehouse. It handles schema changes, incremental updates, and data type conversions, reducing the manual effort required for data integration.

Fivetran supports popular data warehouses such as Amazon Redshift, Google BigQuery, Snowflake, and Microsoft Azure Synapse Analytics. By leveraging Fivetran's connectors and automated processes, you can streamline your data ingestion workflow and ensure that your analytics infrastructure stays up-to-date with the latest data.

Resources:

Text Tutorial:

Fivetran Documentation

WHAT IS FIVETRAN AND WHY YOU SHOULD USE IT by Seattle Data Guy

A beginner’s guide to setting up ELT data pipelines by Fivetran Team

Video Tutorial:

Fivetran Tutorial

Udemy Fivetran Bootcamp : Zero to Mastery 2023

How to automate you data pipelines using Fivetran and dbt on Snowflake

11). Prefect.

Prefect is an open-source workflow management system designed for building, scheduling, and monitoring data pipelines. It provides a Python-based infrastructure to define and execute complex workflows with dependencies, retries, and error handling.

With Prefect, you can create workflows as code, expressing the dependencies and relationships between tasks. It supports various task types, including Python functions, external commands, and API calls. You can also define triggers, schedule workflows to run at specific times or intervals, and monitor their execution through a web-based dashboard.

Prefect integrates with popular data engineering tools and frameworks, such as Apache Airflow, Dask, and Kubernetes. It offers features like distributed execution, fault tolerance, and dynamic task scaling, making it suitable for handling large-scale data processing pipelines.

Resources:

Text Tutorial:
Video Tutorial:

12). Airbyte.

Airbyte is an open-source data integration platform that helps you replicate and sync data from various sources to your data warehouse or data lake. It provides connectors for a wide range of data sources, including databases, cloud applications, APIs, and more.

With Airbyte, you can configure and orchestrate data pipelines to extract data from source systems, transform it if needed, and load it into your desired destination. It supports both batch and real-time data synchronization, allowing you to keep your analytics infrastructure up-to-date with the latest data changes.

Airbyte is designed to be extensible and scalable. You can create custom connectors or contribute to the growing list of community-maintained connectors. It also provides features like schema mapping, incremental

Resources:

Text Tutorial:
Video Tutorial:

13). Talend.

Talend is an enterprise data integration platform that provides a comprehensive set of tools for building and managing data integration workflows. It offers a visual interface for designing data pipelines and supports a wide range of data sources and destinations. Talend includes features like data mapping, transformation, scheduling, and monitoring. It also provides support for data quality checks, data profiling, and data governance. Talend helps streamline data integration processes and enables organizations to implement complex data integration scenarios.

Resources:

Text Tutorial:
Video Tutorial:

14). Debugging.

Debugging is the process of identifying and fixing errors, bugs, or issues in software or data pipelines. In the context of analytics engineering, debugging is crucial for ensuring the correctness and reliability of data transformations, ETL processes, and data analysis workflows.

When debugging data pipelines, it's important to have proper logging and monitoring mechanisms in place. This allows you to track the flow of data, identify potential bottlenecks, and capture relevant information for troubleshooting purposes. Tools like logging frameworks, observability platforms, and error tracking systems can assist in the debugging process.

Additionally, data profiling and exploration techniques can help identify data quality issues, inconsistencies, or unexpected patterns in your data. By analyzing the intermediate results of your data pipeline, you can pinpoint potential problems and apply appropriate fixes.

Resources:

Text Tutorial:
Video Tutorial:

15). Testing.

Testing is a critical aspect of software development and data engineering. It involves systematically verifying that your code, data transformations, and pipelines work as expected and produce accurate results. Testing helps identify bugs, prevent regressions, and ensure the reliability and quality of your data processes.

In data engineering, testing can involve various types of tests, such as unit tests, integration tests, and end-to-end tests. Unit tests focus on testing individual components or functions in isolation. Integration tests verify the interaction between different components, ensuring that they work together correctly. End-to-end tests validate the entire data pipeline, from data ingestion to the final output.

Tools like pytest, unittest, and Apache Beam's testing utilities can assist in writing and running tests for your data engineering code. It's important to establish a comprehensive testing strategy that covers different aspects of your data pipelines, including data validation, error handling, and edge cases.

Resources:

Text Tutorial:
Video Tutorial:

16). Version Control Systems and Data Control Systems.

Version Control Systems (VCS) and Data Control Systems (DCS) are essential tools for managing code, configuration, and data assets in analytics engineering projects.

Version Control Systems, such as Git, provide a way to track changes, collaborate with others, and maintain a history of modifications to your codebase. They allow you to create branches, merge changes, and revert to previous versions if needed. VCS also facilitate team collaboration by enabling concurrent work on different features or bug fixes.

Data Control Systems, on the other hand, focus on managing data assets and ensuring data integrity. They provide mechanisms to track data lineage, enforce access controls, and maintain data versioning. DCS help you maintain a complete audit trail of data transformations and enable reproducibility of data processes.

Tools like DVC (Data Version Control) and Delta Lake combine the concepts of VCS and DCS, providing versioning capabilities for data assets. They enable data scientists and engineers to track changes to datasets, collaborate on data workflows, and ensure consistency across different stages of data processing.

By utilizing both Version Control Systems and Data Control Systems, you can establish robust governance practices, ensure reproducibility, and maintain a clear history of changes in your analytics projects.

Resources:

Text Tutorial:
Video Tutorial:

- Community

Airflow Community

dbt(Data Build Tool) Community

- Forums

- Conferences

- Podcasts

Inspired by the awesome . Created by Vinta Chen Staff Software Engineer at Perpetual Protocol.

Analytics Engineering Glossary by dbt Labs: https://docs.getdbt.com/glossary

Contributing

We welcome and apprecciate contributions! You can find more information in the LuxDevHQ Code Of Conduct contribution guidelines.

⬆ Back To Top

HarunMbaabu/Awesome-Analytics-Engineering

Awesome Analytics Engineering.

Table of Contents

1). Analytics Engineering

Resources:

2). SQL

Resources:

3). Python

Resources:

4). dbt (data build tool)

5). Airflow

Resources:

6). Cloud Computing Fundamentals.

Resources:

7). Data Modeling and ware housing.

Resources:

8). Great Expectation.

Resources:

9). Data visualization.

Resources:

10). Fivetran.

Resources:

11). Prefect.

Resources:

12). Airbyte.

Resources:

13). Talend.

Resources:

14). Debugging.

Resources:

15). Testing.

Resources:

16). Version Control Systems and Data Control Systems.

Resources:

- Community

- Forums

- Conferences

- Podcasts

Contributing