edgi-govdata-archiving/web-monitoring-processing

DB API needs methods for iterating over paginated results

Mr0grog opened this issue · 0 comments

The DB API currently exposes methods like list_pages() and list_versions() that return paginated chunks of results. We’ve long known that it’s a pain to iterate through all the pages, especially when retry support is needed (and it usually is). See web-monitoring-task-sheets for an example of this. Waaaaay too much work.

API-wise, we should either modify the list_*() methods or create new methods that are generators, and provide for iteration over the resulting pages/versions/changes/annotations (from all chunks, not just the first) rather than an object with metadata and links alongside the list for the current chunk.

We should be able to replace code like:

def list_all_pages():
    client = db.Client.from_env()
    chunk = 1
    while chunk > 0:
        pages = client.list_pages(sort=['created_at:asc'], chunk_size=1000,
                                  chunk=chunk, url='some_pattern', active=True,
                                  start_date=datetime(...),
                                  end_date=datetime(...),
                                  include_earliest=True)
        yield from pages['data']
        chunk = pages['links']['next'] and (chunk + 1) or -1

With:

def list_all_pages():
    client = db.Client.from_env()
    yield from client.list_pages(sort=['created_at:asc'], chunk_size=1000,
                                 url='some_pattern', active=True,
                                 start_date=datetime(...),
                                 end_date=datetime(...),
                                 include_earliest=True)

One important difference from the above, however, is that iteration should proceed by using the chunk['links']['next'] URL, rather the fake method above of adding 1 to the chunk number if chunk['links']['next'] is present (i.e. by assuming that those two methods get you the same thing). They are effectively the same right now, but there are issues on the database about exploring other methods of pagination, and we need to follow the actual next URL if we want to be forward-compatible with those kinds of changes.


A couple bonus features/thoughts, both from behaviors we have in the task sheet generation script:

  • Support a threading.Event for canceling the iterator. This isn’t a huge deal, since you can stop reading from the iterator at any time, but we use this pattern in a lot of places (importing, task sheets, etc.) to coordinate early stopping, and directly supporting it helps reduce boilerplate a bit. Basically, instead of:

    def list_all_pages(cancel=None):
        client = db.Client.from_env()
        for page in client.list_pages():
            if cancel and cancel.is_set():
                return
            
            yield page

    It would be nice to be able to do:

    def list_all_pages(cancel=None):
        client = db.Client.from_env()
        yield from client.list_pages(cancel=cancel)
  • Support a way to get the meta field from the current chunk, or to get a count of the total results. In task sheets, we have a total parameter that, if true, asks for the total results on the first chunk and yields the total as the first item of the iterator, before yielding the actual pages/versions/whatever. See here: https://github.com/edgi-govdata-archiving/web-monitoring-task-sheets/blob/3debe85b2776142136b641ddf5767646563f2588/generate_task_sheets.py#L26-L33

    That is admittedly a little wonky, but I don’t have an obviously better design.