A curated list of useful data-related tools to keep an eye on.
ℹ️
|
My take after several years of working in big data
Computers are faster everyday. So for your simplicity’s sake try to develop single-box applications that can serve multiple users. Avoid distributed apps unless really needed. |
-
sqlLite: SQL database without hassle. No need for external service since you are just using a library and a file. Good for enough of your needs unless you scale beyond multiple machines. See DHH’s take
-
CockroachDB distributed and resilient
-
Materialize fast analytics on constantly changing data. Built on top of differential dataflow
-
ManticoreSearch: Elasticsearch alternative
-
SeaweedFS Open source distributed object storage with O(1) file access, tiered storage and S3 API.
-
influxdb Open source time series DB for metrics, events and real-time analytics.
-
clickhouse Column-oriented db for metrics, events and real-time analytics. With SQL
-
classic neo4j
-
Memgraph fast graph db?
-
Ibis: Run to multiple data processing backends using the same API
-
DuckDB: query TBs of data with its bigger-than-memory capabilities (tewaking might be needed)
-
Apache Spark: trusted old software. Though you need a place to run it.
-
Differential dataflow: a different computation paradigm for real-time updates on data
-
Trino: distributed sql engine. Though maintaining its clusters can be challenging (What AWS Athena is based on)
-
networkX
-
Experimental - GraphSurge Built on top of differential dataflow
-
Apache Airflow: The good old cron with vitamins that just works
-
Astro CLI easy airflow dev environment in your local machine using docker
-
Challenges: bit on the deployment side. Can’t deploy multiple code branches in the same instance like others allow you to
-
Mage: new contender
-
evidence Open-source markdown-to-dashboard library
-
taipy Open-source python data pipeline to web app framework
-
gradio Open-source python web app for ML and AI. Used by multiple companies: if you’ve tested a model in hugginface, you’ve used gradio.
-
Streamlit python web app builder. You can deploy apps on their cloud for free.
-
Hugging Face AI everything opensource: community, model repository, libraries, free GPUs….
-
LangChain Develop apps that use LLMs
-
LangFlow build agents and RAG applications in a visual way
-
Gumloop Automate your tasks with LLMs with lots of connectors (gmail, google calendar, drive…)
-
poetry: for dependency management
-
ruff: fast linting and code styler
-
Pydantic: better dataclasses with schema validation
-
FastAPI: Best api library on python. With automated docs and schema validation
-
Typer: for cli applications
-
open-telemetry: standard observability library
-
logfire: observability platform built by the pydantic team