DOI-USGS/dataretrieval-python

Tidy or not

Closed this issue · 4 comments

We need to decide which functions return tidy data or not. Tidy data is typically more convenient for small datasets and more familar to pandas users, but it all also requires much more space in memory. My original thinking was to implement two types of functions in data_retrieval. Either function could return tidy or not, but one type would default to tidy and the other would used the defaults of the REST API. The tidy functions would wrap the lower-level, 'untidy' functions' and provide a more user-friendly interface to the library. Any data returned by the tidy functions would have a consistent structure and indexing, so that a user could combine combine datasets from different sources without much fuss.
The untidy functions, on the other hand, provided a more transparent interface to the REST API's. In that sense they are much more flexible and can access a wider range of options that may not be accessible through the tidy functions. Many, but not all, of the REST API's don't tidy by default, because it's more performant not to.
Is this approach reasonable or is there a better way?

In my opinion, there is some value in having functions for downloading the data in its original format, and separate functions for translating that format into other formats.

The NWIS data format is capable of keeping track of very complex situations, and most translations into other formats end up losing some of the information in the original. The WaterML format is like a 4-dimensional data table, with dimensions that include: the site, time of observation, the parameter measured, and data/data-qualifier flags.

Usually if you translate the original JSON or WaterML into a Pandas table, you would make each row a unique time stamp to take advantage of Pandas great time indexing, and each column would hold the data for a different site. But that leaves out many of the original dimensions! For example, one site might have data for two parameters, which another site might have data for only one parameter. You can deal with that by making three columns, and having each column represent a unique combination of site & parameter. But now where do you store the data qualifier flags? The USGS provides a data qualifier flag for every observation to indicate if it is 'Approved' or 'Provisional' or affected by ice or something. Most systems throw this data out. Finally, the time dimension can be really complicated. You can have sites that collect one parameter every five minutes, and another parameter just once a day. One site may have been collecting data for 50 years, another site for just 1 year in the 1990's. A typical way of formatting data in Pandas is to choose a regular time interval and have each row represent a time. If your shortest sampling interval is 5 minutes, then you need to resample every site, every parameter down to 5 minutes, starting at the time of the first observation and ending with the last observation. This means that most cells will have a 'no data' value in them.

My understanding of the tidy format is that it might be the most efficient in terms of space because it does not need to store 'no data' values.

I may be wrong, but I think that for WaterML data, the Tidy format would make each row a unique combination of site and timestamp, and the actual observation would go into a single 'Value' column. Also, each parameter would be stored in a different table, so the 'Value' column could be renamed after the parameter, like "Discharge" or "Stage". Since every observation is also paired with a qualifier flag, you could probably safely make a second column that contains the qualifier flag instead of making a qualifier table that mirrors each of the parameter tables.

To use this "Long" format, you would need to do a lot of filtering before you try to plot a dataset or calculate a daily average. But the table would be completely flat, so it could be stored in one of the new columnar big-data formats like Parquet.

I'm sorry for this wall of text.

In hydrofunctions, I've tried to make a layer of functions that request the data, and then another layer of functions that re-format the data into dataframes, and then an object-oriented API that the user interacts with to get work done. As my thinking has evolved on how to deal with these complex datasets, I haven't had to change the API for the lowest level of functions too much.

TLDR: your thinking may change over time, so keep it simple and don't reformat the data until you have to.

Thanks mroberge,
I see your well familiar with some of the challenges posed by NWIS.
The plan is to keep dataretrieval pretty narrowly focused on just doing data retrieval. Data transformations (like resampling) would probably be done elsewhere. At some point we'll publish more manual/doc with some usage examples, that would show how to do some common functions in pandas: resampling, merging, rollingmeans.
We will however offer some basic tidying utilities...which brings me to your point on Parquet. Dataretrieval leaves it up to you to decide how to store your data on disk. My point about poor performance of tidy data was more about the space of the dataframe in memory. For example, if you wanted to put all the world's water quality data in one tidy dataframe, you'd be left with one very very large and very very sparse pandas dataframe: one row for every site/timestamp and one column for every parameter, qualifier, etc. But as you point out, it's easy to compromise some tidyness for performance. I think that sort thing will be outside the scope of dataretrieval. We realize there is no one 'best' way to format your data, so we won't try. Rather our goal is mainly to maintain a python interface to the USGS REST APIs, so that you don't have to.
Thanks for pointing me to hydrofunctions. I'll check it out.
PS. Readers not familiar with the term 'tidy data', should see
https://vita.had.co.nz/papers/tidy-data.pdf

It would really help if you added some docstrings at the package level as well as for some more of your functions. I had to go into the code to figure out how to use it!

I was about to do a major release for Hydrofunctions when the virus hit. It has some major updates to the docs and includes tools for reading the USGS services that use RDB formatting, like the field data and peaks services.

I would be happy to port some of my code to dataretrieval, port some of your code to Hydrofunctions, and then separate our concerns. Hydrofunctions could then import dataretrieval and Ulmo to do the retrieval.

One aspect of data retrieval from the NWIS that I've tried to focus on is creating helpful error messages for users. It is not always clear to new users why NWIS returns { }, but it also can return some really helpful error messages embedded in HTML that I've tried to capture.

There are new EPA efforts underway to start to tidy WQP results in R and python (both depend on dataretrieval). Really early stage and would welcome any feedback.