stanfordio/truthbrush

Paginate the search command

lxcode opened this issue · 6 comments

Just loop through and replace limit with offset.

Agree.. really easy and important.

Turns out that it's a bit tougher than I thought, because of page drift (new qualifying posts appearing at the top of the search result) messing up our numbering for offset. I think we also need to add the max_id and min_id option to keep new posts from mucking things up. So maybe we want pagination that goes something like:

  1. Figure out a good min_id and max_id for whatever date range we want to search. Not sure how to do that, since search results don't seem to come in time order. I guess we could just use some preliminary search results. Or, we could figure out the right ids for some date restriction by using the pull_statuses() function of some very active account, since that can be constrained by date.
  2. Loop through searches, collecting and appending 40 statuses at a time from the results['statuses'] json
  3. Some stop condition, probably an empty pull.

This sort of works. I end up with some page overlap, but not much. In three pulls of (allegedly) 40 truths each, I end up with 104 unique posts. It's pretty kludged together, but I expand the search() method like this
def search( self, searchtype: str = None, query: str = None, limit: int = 4, resolve: bool = 4, offset: int=0, min_id: str='0', max_id: str=None ) -> Optional[dict]:
I also added an if statement to deal with the case where no max is specified, so you can leave out all the offset/min/max stuff and it will run as usual. I'm happy to do a pull request on this if it's useful. But maybe there is a way better way of doing this.

Hmm... looks like there is a max offset of 10k. So, we need to set relatively narrow searches, but it definitely works in principle.

@patrick-lee-warren Thanks for digging into it — feel free to submit a PR and we can poke at it further. FWIW the end condition appears to just be a JSON with empty elements: {"accounts":[],"statuses":[],"hashtags":[]}

Ok, should be workable now.