DOI-USGS/dataretrieval-python

Consider making the functions called by `get_record()` private

mnfienen opened this issue · 2 comments

As a new user, in a jupyter notebook, I used the tab completion on dataretrieval as imported and found the many get_... functions including get_pmcodes. But then I was surprised that it returned a tuple including metadata.

Looking in the code, I see that get_record is a light wrapper around all these that returns only the df. All good, but it might be nice to push users toward get_record and make all the underlying functions private. Just a refactor from get_pmcodes to _get_pmcodes etc. I'm happy to refactoir, but wanted to see if this is something intentional for reasons I'm not seeing.

Alternatively, could there be a default in all the get functions to suppress returning the metadata unless requested? This would make it easier to know which **kwargs the underlying function needs.

I originally used get_record, but other contributors convinced me to deprecate it (it retains its legacy behavior)

Metadata was a later addition. Nobody every liked how it was implemented, but this was fundamentally a limitation of pandas and its lack of a metadata standard.
Ideally, we'd put the metadata in pandas.DataFrame.attrs
but pandas has flagged attrs as experimental and may change without warning...

My plan was to continue to return a tuple until pandas improves its metadata or xarray natively handles ragged arrays.
But we've waited several years already, so it's good to revisit this.

You might also prefer HyRiver, which is another great collection of packages. I frequently use both. dataretrieval has a much simpler creedo, which is to do one thing well.

Right on - thanks for the info.

I'm pretty stoked on this project being simple and being supported by USGS. There are other packages out there as well but they are just complicated and I like the idea of staying focused on core functionality.

So - using the various specific get functions makes good sense too. Maybe the default idea that allows a request for the metadata but defaults to only returning the dataframe? I may be missing something, but seems like the metadata is more valuable for debugging than general use?