
A curated list of useful data-related tools to keep an eye on.

Creative Commons Zero v1.0 UniversalCC0-1.0

Awesome Data Tools

A curated list of useful data-related tools to keep an eye on.

My take after several years of working in big data

Computers are faster everyday. So for your simplicity’s sake try to develop single-box applications that can serve multiple users. Avoid distributed apps unless really needed.


  • sqlLite: SQL database without hassle. No need for external service since you are just using a library and a file. Good for enough of your needs unless you scale beyond multiple machines. See DHH’s take

  • CockroachDB distributed and resilient

  • Materialize fast analytics on constantly changing data. Built on top of differential dataflow

Queues (streaming)

  • Redpanda: kafka alternative

Key Value

  • ScillaDB: cassandra alternative

Object Store

  • SeaweedFS Open source distributed object storage with O(1) file access, tiered storage and S3 API.

Time Series

  • influxdb Open source time series DB for metrics, events and real-time analytics.

  • clickhouse Column-oriented db for metrics, events and real-time analytics. With SQL



  • Ibis: Run to multiple data processing backends using the same API

Single node engines

  • DuckDB: query TBs of data with its bigger-than-memory capabilities (tewaking might be needed)

Distributed engines

  • Apache Spark: trusted old software. Though you need a place to run it.

  • Differential dataflow: a different computation paradigm for real-time updates on data

  • DataFusion

  • Trino: distributed sql engine. Though maintaining its clusters can be challenging (What AWS Athena is based on)

Graph engines


  • Apache Airflow: The good old cron with vitamins that just works

  • Astro CLI easy airflow dev environment in your local machine using docker

  • Challenges: bit on the deployment side. Can’t deploy multiple code branches in the same instance like others allow you to

  • Mage: new contender

Graphical User Interfaces (Dashboards, BI…​)

  • evidence Open-source markdown-to-dashboard library

  • taipy Open-source python data pipeline to web app framework

  • gradio Open-source python web app for ML and AI. Used by multiple companies: if you’ve tested a model in hugginface, you’ve used gradio.

  • Streamlit python web app builder. You can deploy apps on their cloud for free.


  • Hugging Face AI everything opensource: community, model repository, libraries, free GPUs…​.

  • LangChain Develop apps that use LLMs

  • LangFlow build agents and RAG applications in a visual way

  • Gumloop Automate your tasks with LLMs with lots of connectors (gmail, google calendar, drive…​)


  • poetry: for dependency management

  • ruff: fast linting and code styler

  • Pydantic: better dataclasses with schema validation

  • FastAPI: Best api library on python. With automated docs and schema validation

  • Typer: for cli applications

  • open-telemetry: standard observability library

  • logfire: observability platform built by the pydantic team

Python test setup

  • pytest

  • pytest-cov setting the lower limit at 80%

  • hypothesis for property based testing (generates test data) + hypothesis-auto

  • pytest-watch to continuously run tests in the backend

  • Memray (memory profiler) + pytest-memray to set limits on memory consumed in individual tests