Wikia/sroka

developing intake-sroka

martindurant opened this issue · 6 comments

This is a very convenient interface to several data provider APIs that have been requested in terms of the Intake and Dask projects.

I have just created intake-sroka, so that specific queries to the APIs can be saved as data sources, and stored in Intake's cataloging system. You are very welcome to comment and participate, to bring such data to wider attention!

I wonder, have you thought about how to access data in a parallel or distributed way? Many query outputs might be partitionable, and Dask dataframe makes turning a set of data-frame partitions into a logical dataframe for parallel, out-of-core and/or distributed processing easy. We already do this, for example when reading from parquet or SQL servers.

Thank you! We're really happy that you were able to include our library into intake project. We will take a look at intake-sroka.

As for parallel/distributed way of data access, this is very interesting idea. For now it is not on our roadmap to include it, but I will label this issue as enhancement. And we can return to it later (or maybe there will be other contributors - or you - that would like to add it to sroka?). For some of the API I think it may be problematic (due to restrictions on number of queries and time required between queries). Also for some data sources (like MOAT, Rubicon) data is not so complex that it would really make sense, for other it definitely would.

As for Dask dataframe itself, you mean that it would be helpful to have output available also as Dask dataframes?

As for Dask dataframe itself, you mean that it would be helpful to have output available also as Dask dataframes?

Not, I don't think it would be necessary to do this on your end, especially if intake-sroka is to become fully functional. What Dask normally needs is:

  • some sensible way to split the query into chunks, which may require, for example, a pre-query to determine a set of facets, one for each piece
  • a way to fetch each piece, usually a function that takes the query and a value of the facet from the previous task, and returns a pandas dataframe
  • a client which can handle multiple simultaneous requests in the same process or between processes. Since these are all REST HTTP calls, I expect this is already true.
  • ideally, knowledge of the fields/dtypes expected in the output without doing any read on the server, or perhaps a very minimal read (such as first ten items).

As you say, it may well not make sense to bother for some of the APIs - you know what kind of data to expect in each much better than me - but for the cases where there is big data and parallel access might be beneficial, it would certainly be nice to have.

How do you go about testing your calls? Intake-sroka would likely want to copy any methods you have (since I don't actually have easy access to the real APIs)

I addition, I guess, your APIs also bring the question of auth: it may be good to do this once and pass around the auth objects between processes, rather than having to re-authenticate in every task. Or maybe it's fast, because only the reading of some json file on the disc is needed. I simply don't know, so it's worth discussing.

I'll be circling back to this shortly. The very best help that I could use, is a way to test data read functions without actually connecting to the cloud or having valid credentials. Do you have some mocking solution or other testing infrastructure I can use?

For Dask: thank you for clarification. I think that would definitely be helpful with s3 data and maybe GAM data too as this can be pretty complex.
For auth: in most cases it is fast. One different case is for first auth for google products as those require to authenticate through a link.
For mocking solution/testing infrastructure: We don't have ready solution for you yet, but this is something important for us too for testing purposes and we work on it. As for connections we have notebook scenarios that we always use but as you said, some tests should be available without credentials. As I mentioned we are working on tests, do you have any specific use cases that you'd like to be included? We will include those in this repository when ready.

When I have rounded out intake-sroka more, I will be sure to be in touch and get you to test against your data and credentials, until there are concrete ways to tests at least some of the services.