Retrieval Event DB
jcace opened this issue · 2 comments
We need a database to store the stream of retrieval event data that we use to compute reputation scores from. We need to define (1) the schema for the db, and (2) the underlying database technology to use
Bedrock has defined a great schema that we can base this off: https://www.notion.so/Retrieval-Reputation-Schema-edcf4e8b89674343a45f62215c6e6ea9
Database Technologies
First option - Pando
Pando is a custom database solution designed for Filecoin reputation data. Bedrock is planning to use it to store retrieval statistics data which will be used to compute reputation.
Explore how we can integrate Autoretrieve stats into Pando. Investigate what it looks like to push data in / pull data out
Second option - Event DB built on top of Postgres
- prefer an existing stats db like https://github.com/application-research/estuary-metrics.
Third Option - Timeseries DB
TimescaleDB (Postgres) https://github.com/timescale/timescaledb
InfluxDB (open-source time-series DB) https://github.com/influxdata/influxdb
Just discovered we already have a database called estuary-metrics. This could serve as a nice place to store all these raw metrics:
I think it might make sense to tweak the schema of estuary-metrics : remove retrieval_success_records and retrieval_failure_records , and instead combine them into a single retrieval_events table. This new table would look mostly like the retrieval_success_records , with a flag for failed to capture the failure events.
Since we need both success/failure counts in our reputation calculation, I think this would make it quite ergonomic for us. We could query it once (aggregate by matching sp, in a given timestamp window),
- also look at https://github.com/filecoin-project/cidtravel approach