datopian/data-api

[epic] Support for large queries from data API

Opened this issue · 1 comments

When querying the data API I want to be able to make queries and get results with 100k or 1m+ results and download them so that I can extract the data I want even if larger

Acceptance

  • Design the solution
    • Consider authorization considerations
  • Write to storage approach
    • Choose storage backend
    • Stream to storage from data API (or trigger background job)
    • Return download URL
  • Setup switch from "streaming" to write to storage

Analysis

There are 2 approaches:

  1. Stream the whole result
  2. Extract the query results to storage and give storage url to the user

One can also have hybrid e.g. do the former up to some number of results and then switch to 2.

There are several advantages of option 2:

  • You have a natural cache structure on disk so that the same query may not need to be recomputed (you can expire exported results after some time period)
  • If your download/stream is interrupted you can resume it from storage (rather than re-running the query)
  • You give the end user the option to share the file for a certain period of time with other users

The disadvantages (at least the small data sizes):

  • It's more complex / more work on backend
  • Slower (greater latency)
  • More complex for user: they have 2 steps where there was one (get query result, extract download url and download)

@rufuspollock
The current implementation Streams the results, we are testing the performance in some testing (but distributed) environments.

Tests wen't right and now it seems that any extra optimization needs to be in the Hasura and Postgres queries and views.

For further improvements, the current issue is the next step, but due to time limitations we'll not start with this for the moment.