application-research/estuary

Retrieval Event DB

jcace opened this issue · 2 comments

jcace commented

We need a database to store the stream of retrieval event data that we use to compute reputation scores from. We need to define (1) the schema for the db, and (2) the underlying database technology to use

Bedrock has defined a great schema that we can base this off: https://www.notion.so/Retrieval-Reputation-Schema-edcf4e8b89674343a45f62215c6e6ea9

Database Technologies

First option - Pando

Pando is a custom database solution designed for Filecoin reputation data. Bedrock is planning to use it to store retrieval statistics data which will be used to compute reputation.

Explore how we can integrate Autoretrieve stats into Pando. Investigate what it looks like to push data in / pull data out

Second option - Event DB built on top of Postgres

Third Option - Timeseries DB

TimescaleDB (Postgres) https://github.com/timescale/timescaledb
InfluxDB (open-source time-series DB) https://github.com/influxdata/influxdb

jcace commented

Just discovered we already have a database called estuary-metrics. This could serve as a nice place to store all these raw metrics:

I think it might make sense to tweak the schema of estuary-metrics : remove retrieval_success_records and retrieval_failure_records , and instead combine them into a single retrieval_events table. This new table would look mostly like the retrieval_success_records , with a flag for failed to capture the failure events.

Since we need both success/failure counts in our reputation calculation, I think this would make it quite ergonomic for us. We could query it once (aggregate by matching sp, in a given timestamp window),