A repo catalog of data engineering tools comparing with cloud, enterprise, and open source
This is meant to serve as reference for folks to try and use data engineering tools and frameworks without relying on cloud computing resources and servces. Data engineering is a challenging field to get into due to the limitations you have of working in data before getting work experience in data. My hopen is to give you the resources you need, starting with listing out the tools that you can try on your own. They are basically the same throughout!
Functionality | Description | Open Source | Enterprise/Cloud Service |
---|---|---|---|
Data ingestion | The process of importing data from external sources into a data storage system. | Apache Nifi, Apache Flume, Meltano, Airbyte | AWS Glue, Azure Data Factory, Meltano, Airbyte |
Data transformation | The process of cleaning, normalizing, and converting data into a format suitable for analysis. | Apache Beam, Apache Spark, dbt | AWS Glue, Azure Databricks, dbt Enterprise |
Data storage | The process of storing data in a structured or unstructured manner, often in a distributed system. | Apache Hadoop (HDFS), Apache Cassandra | AWS S3, Azure Blob Storage |
Data processing | The process of executing computations on data, such as aggregations, filtering, and machine learning algorithms. | Apache Flink, Apache Spark | AWS EMR, Azure HDInsight |
Data visualization | The process of creating visual representations of data, such as charts, graphs, and maps. | Apache Superset | Tableau, Google Data Studio |
Data warehousing | The process of storing and organizing data in a central location for reporting and analysis. | Apache Hive, Apache Druid | AWS Redshift, Azure Synapse Analytics, Google BigQuery |
Data governance | The process of managing and governing data within an organization, including security, privacy, and compliance. | Apache Ranger, Apache Atlas | AWS Glue, Azure Purview, Google Cloud Datapolicy Manager |
Data lineage | The process of tracking the origin and movement of data within an organization. | Apache Atlas, Apache Falcon | AWS Glue, Azure Purview, Google Cloud Datapolicy Manager |
Data quality | The process of ensuring that data is accurate, complete, and consistent. | Apache DataFu, Talend | Talend, Informatica, Google Cloud Data Quality Services |
Data catalog | A system for organizing and storing metadata about data assets within an organization. | Apache Hive Metastore, Apache Atlas | AWS Glue, Azure Purview, Google Cloud Datapolicy Manager |