/awesome-data-tools

A curated list of useful data-related tools to keep an eye on.

Creative Commons Zero v1.0 UniversalCC0-1.0

Awesome Data Tools

A curated list of useful data-related tools to keep an eye on.

ℹ️
My take after several years of working in big data

Computers are faster everyday. So for your simplicity’s sake try to develop single-box applications that can serve multiple users. Avoid distributed apps unless really needed.

Databases

  • sqlLite: SQL database without hassle. No need for external service since you are just using a library and a file. Good for enough of your needs unless you scale beyond multiple machines. See DHH’s take

  • CockroachDB distributed and resilient

  • Materialize fast analytics on constantly changing data. Built on top of differential dataflow

Queues (streaming)

  • Redpanda: kafka alternative

Key Value

  • ScillaDB: cassandra alternative

Object Store

  • SeaweedFS Open source distributed object storage with O(1) file access, tiered storage and S3 API.

Time Series

  • influxdb Open source time series DB for metrics, events and real-time analytics.

  • clickhouse Column-oriented db for metrics, events and real-time analytics. With SQL

Graph

Processing

  • Ibis: Run to multiple data processing backends using the same API

Single node engines

  • DuckDB: query TBs of data with its bigger-than-memory capabilities (tewaking might be needed)

Distributed engines

  • Apache Spark: trusted old software. Though you need a place to run it.

  • Differential dataflow: a different computation paradigm for real-time updates on data

  • DataFusion

  • Trino: distributed sql engine. Though maintaining its clusters can be challenging (What AWS Athena is based on)

Graph engines

Orchestration

  • Apache Airflow: The good old cron with vitamins that just works

  • Astro CLI easy airflow dev environment in your local machine using docker

  • Challenges: bit on the deployment side. Can’t deploy multiple code branches in the same instance like others allow you to

  • Mage: new contender

Graphical User Interfaces (Dashboards, BI…​)

  • evidence Open-source markdown-to-dashboard library

  • taipy Open-source python data pipeline to web app framework

  • gradio Open-source python web app for ML and AI. Used by multiple companies: if you’ve tested a model in hugginface, you’ve used gradio.

  • Streamlit python web app builder. You can deploy apps on their cloud for free.

AI

  • Hugging Face AI everything opensource: community, model repository, libraries, free GPUs…​.

  • LangChain Develop apps that use LLMs

  • LangFlow build agents and RAG applications in a visual way

  • Gumloop Automate your tasks with LLMs with lots of connectors (gmail, google calendar, drive…​)

Tools

  • poetry: for dependency management

  • ruff: fast linting and code styler

  • Pydantic: better dataclasses with schema validation

  • FastAPI: Best api library on python. With automated docs and schema validation

  • Typer: for cli applications

  • open-telemetry: standard observability library

  • logfire: observability platform built by the pydantic team

Python test setup

  • pytest

  • pytest-cov setting the lower limit at 80%

  • hypothesis for property based testing (generates test data) + hypothesis-auto

  • pytest-watch to continuously run tests in the backend

  • Memray (memory profiler) + pytest-memray to set limits on memory consumed in individual tests