/data_engineering_tools

A repo catalog of data engineering tools comparing with cloud, enterprise, and open source

Data Engineering Tools

A repo catalog of data engineering tools comparing with cloud, enterprise, and open source

This is meant to serve as reference for folks to try and use data engineering tools and frameworks without relying on cloud computing resources and servces. Data engineering is a challenging field to get into due to the limitations you have of working in data before getting work experience in data. My hopen is to give you the resources you need, starting with listing out the tools that you can try on your own. They are basically the same throughout!

Functionality Description Open Source Enterprise/Cloud Service
Data ingestion The process of importing data from external sources into a data storage system. Apache Nifi, Apache Flume, Meltano, Airbyte AWS Glue, Azure Data Factory, Meltano, Airbyte
Data transformation The process of cleaning, normalizing, and converting data into a format suitable for analysis. Apache Beam, Apache Spark, dbt AWS Glue, Azure Databricks, dbt Enterprise
Data storage The process of storing data in a structured or unstructured manner, often in a distributed system. Apache Hadoop (HDFS), Apache Cassandra AWS S3, Azure Blob Storage
Data processing The process of executing computations on data, such as aggregations, filtering, and machine learning algorithms. Apache Flink, Apache Spark AWS EMR, Azure HDInsight
Data visualization The process of creating visual representations of data, such as charts, graphs, and maps. Apache Superset Tableau, Google Data Studio
Data warehousing The process of storing and organizing data in a central location for reporting and analysis. Apache Hive, Apache Druid AWS Redshift, Azure Synapse Analytics, Google BigQuery
Data governance The process of managing and governing data within an organization, including security, privacy, and compliance. Apache Ranger, Apache Atlas AWS Glue, Azure Purview, Google Cloud Datapolicy Manager
Data lineage The process of tracking the origin and movement of data within an organization. Apache Atlas, Apache Falcon AWS Glue, Azure Purview, Google Cloud Datapolicy Manager
Data quality The process of ensuring that data is accurate, complete, and consistent. Apache DataFu, Talend Talend, Informatica, Google Cloud Data Quality Services
Data catalog A system for organizing and storing metadata about data assets within an organization. Apache Hive Metastore, Apache Atlas AWS Glue, Azure Purview, Google Cloud Datapolicy Manager