/modern-data-tools

A list of tools frequently used in "the modern data stack". Mostly SaaS or open-source apps to work with data and get value from it.

List of Modern Data Tools

What is "the modern data stack", actually?

Thats a tough one! You can't really say that there is such a thing as the one and only modern data stack.However, there are certain ideas/tools/do's and dont's/etc. that can be combined to form one thing that is modern and is a data stack. These tools tend to be easy-to-use, open, cloud-native or even SaaS. Go Data Driven has a blog article that describes this pretty well. Are you asking yourself now, "So that's it, I'll just switch all my tools to the ones listed here and then I've got it?" - Far from it! As always, it's all about the people ...

WORK IN PROGRESS:

Next things to add

  • more tools
  • maybe best practices / reference architecutres?

DISCLAIMER

The author is in no way affiliated with the companies listed here and receives no compensation. The listing serves as an example and does not claim to be complete.

Contents

data-platforms

Answer the question: where to put the data?

  • Amazon S3: Typical cloud-based object storage that integrates with a multitude of different data lake tools. Offered by AWS.
  • Google Cloud Storage: Typical cloud-based object storage that integrates with a multitude of different data lake tools. Offered by Google.
  • Azure Data Lake Gen2: Typical cloud-based object storage that integrates with a multitude of different data lake tools. Offered by Microsoft.
  • Delta Lake: An open-source file engine build ontop of apache parquet that brings high performance, acid transactions, updates & deletes, and muche more to your data lake. Compatible with Spark, Databricks, etc.

data-integration

Answer the question: how to get the data from the system into our platfrom?

  • Meltano: open-source data integration using the singer specification.
  • Airbyte: open-source data integration with a rich UI, an API and CLI.
  • Fivetran: fully managed SaaS data integration tool with several prebuild connectors.
  • Stitch Data: fully managed SaaS data integration tool with several prebuild connectors and an SDK (Singer) to easily add more integrations.
  • Azure Data Factory: cloud based data intagration, transformation and orchestration in Azure.
  • Kafka Connect: When working with Kafka use kafka connect to integrate your data natively into your data platform.

data-tranformation

Answer the question: how can we make this data consumable?

  • DBT: SQL first data transformations in your datawarehouse. The "T" in ELT. Extends SQL with jinja templates for more powerful operations. Open-source or SaaS (offered by DBT Labs).
  • AWS Glue: Serverless Apache Spark for big data transformations. Enriched by AWS Glue tools and integrations.
  • Databricks: Managed Apache Spark for big data transformations in Python, Scala, Java, SQL, or R. Enriched by Databricks Utils, integrations, and a whole ecosystem of different tools (e.g. Delta Lake).

data-orchestration

Answer the question: which data comes in from where and when?

  • Dagster: open-source data orchestration tool to build DAGs and run them. Provides an UI, CLI and an API.
  • Prefect: data orchestration SaaS tool with API, UI and CLI.
  • Airflow: open-source platform to orchestrate and monitor workflows with several integrations and a powerful UI.
  • Google Cloud Composer: Cloud based version of Airflow in GCP.
  • AWS Stepfunctions: Serverless orchestration tool for AWS Services.

data-analytics

Answer the question: how can we access tha data?

  • Snowflake: data warehouse of the cloud era. Sits on top of your data platform and offers you a fully fledged data warehouse.
  • Google BigQuery: Cloud DWH from google. Serverless and dynamically scalable to adjust to any workload.
  • Azure Synapse Analytics: Cloud DWH from Azure (SQL Pools). Use serverless (on top of the data datalake) or with provisioned instances in the Azure Cloud.
  • AWS Redshift: Cloud DWH from Amazon. Analyze your data in a relational SQL DWH which looks and feels like postgres, or directly in the datalake with Redshift Spectrum.
  • Dremio: SQL Lake House Solution: Build your DWH on top of your data lake to reduce data movement.
  • Databricks SQL Analytics: SQL Lake House from Databricks with the power of the delta engine which allows you to use your Data Lake just like a DWH.
  • Trino (formerly PrestoSQL): Open-source SQL-Based federated query engine to query files in your data lake as well as distributed databases.

data-visualization

Answer the question: what does the data tell us?

  • Tableau: Dashboards and BI from Tableau (Sales Force). Self hosted or SaaS.
  • Looker: Google-cloud based SaaS BI Platform. Supports AWS Deployment as well.
  • PowerBI: Dashboards and BI from Microsoft. Self hosted or SaaS in Azure.
  • ThoughtSpot: SaaS, search-driven BI Platform to allow users create visualizations with natural language questions.
  • Mode: BI Platform that integrates self-service dashboards with more advanced analytics tools such as Python and R notebooks.
  • lightdash: Open source alternative to looker (early dev stage) with native dbt integration.

data-quality

Answer the question: can we trust that data?

  • Datafold: Data observability platform, SaaS based. With features such as Data Diff to regression test your ETL Code.
  • Monte Carlo Data: SaaS based data observability platform to detect data-downtime and ensure data reliability.
  • Bigeye: ML-powered data quality analytics SaaS for the whole data platform. Connects to your DWH and your alerting solutions easily.
  • Great Expectations: A python package to help you test and validate your data in the ETL process.

data-explorability

Answer the question: what data do we actually have?

  • Atlan: A modern data catalog for the modern data stack. Integrate and expose all your data assets in a curated and centralized repository.
  • Tableau Data Management Add-On: Data Catalog and lineage for tableau online or server deployment.
  • DBT Docs: Catalog from open-source data transformation tool dbt. Auto-genereated from your DBT Transformations and extensible in a YAML template.
  • Amundsen: open-source data catalog developed by lyft and used by multiple other companies. Integrations to common data stores like redshift, snowflake, bigquery, etc.

machine-learning

Answer the question: what can the data tell us about the future?

  • Amazon Sagemaker: Build, train, deploy, and monitor machine learning models with this platform in AWS.
  • Databricks Managed ML Flow: An open source machine learning platform based on top of Apache Spark.
  • GCP Vertex AI: Managed ML Ops platform in GCP.
  • Kubeflow: Kubernetes-based, framework-agnostic, cutting-edge machine learning platform.
  • Tensorflow Extented: Open-source framework to build production ready ML pipelines with Tensorflow.

languages

Answer the question: how to do all this?

  • SQL: Database Query Language, the must-have for all data peoples.
  • Python: the best programming language on earth (author's opinion 😜).
  • Terraform: Infrastructure as code (IaC) for multiple cloud providers. Uses own language (HCL).
  • Pulumi: IaC for multiple cloud providers. Uses mainstream languages like JavaScript and Python.
  • AWS CDK: IaC for AWS Cloud written in TypeScript with APIs for multiple programming languages.