DomainTools/python_api

Pagination improvements

Closed this issue · 3 comments

Hello!

There are two aspects to this ticket.

Document that the API only returns the first 500 results

It seems like a pretty important thing to note, but the API (notably the iris_investigate api - your most used endpoint) only returns 500 results by default. This is not stated in the main places you might expect it to be:

https://github.com/DomainTools/python_api
https://github.com/DomainTools/python_api/blob/master/domaintools/api.py#L277

The documentation string for iris_investigate could even be construed to indicate that all of the results are returned:

 You can loop over results of your investigation as if it was a native Python list:
            for result in api.iris_investigate(ip='199.30.228.112'):  # Enables looping over all related results

Handle pagination natively within the library

Wanting to get all results for a query rather than just the first 500 seems like a common use case for users - I tried the most obvious method of adding limit=5000 as an argument to iris_investigate e.g.:

with domaintools_obj.iris_investigate(search_hash=SEARCH_HASH, limit=5000) as results:
    for result in results:
       ...

However this appears to have no effect. Inspecting the library code I think that this isnt a valid argument.

If this is the case, it would be nice if pagination were handled within the library.

Thanks,
Tom

For anyone reading this who wants to accomplish this before the client library is improved to do this natively, its not too hard to do:

def get_paginated_dt_results(query, position=None, results=[], limit=500):
    with domaintools_obj.iris_investigate(search_hash=query, position=position) as dt_results:
        for result in dt_results:
            results.append(result)
        if len(results) >= limit:
            return results
        if dt_results['has_more_results'] is True:
            position = dt_results['position']
            return get_paginated_dt_results(query, position=position, results=results, limit=limit)
    return results 

There appears to be a related issue in the example. If the result is a multiple of the page limit (500), the last page will contain "has_more_results": true, but there will be no position.
https://github.com/DomainTools/python_api/blob/main/examples/retrieving_all_results_in_paginated_return.py

>>> while response['has_more_results']:
...     response = dt_api.iris_investigate(search_hash=query_hash, position=response['position'])
...     results.extend(response['results'])
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "./domaintools/base_results.py", line 188, in __getitem__
    return self.response()[key]
KeyError: 'position'

Adding and 'position' in response to the while statement will work around the server response issue.

from domaintools import API


dt_api = API(USER_NAME, KEY)
query = "SEARCH_HASH"
response = dt_api.iris_investigate(search_hash=query)
results = response['results']
while response['has_more_results'] and 'position' in response:
    response = dt_api.iris_investigate(search_hash=query, position=response['position'])
    results.extend(response['results'])

print(results)

Thank you for the comment @malvidin, we believe we've addressed this edge case in our API response for Iris Investigate.