openaq/openaq-data-format

Changing name of "averaging_Period" and adding field that truly indicates time resolution

Opened this issue · 4 comments

Suggest changing currently defined 'averaging_Period' to 'reporting_Interval' and adding in true 'averaging_Period'

The motivation for this suggested change:
As I understand it, currently, 'averaging_Period' is the interval of time over which a datapoint on a site is updated, e.g. UB data is updated on an hourly basis on the website. But this is not the same as the temporal resolution of the measurement itself. In a lot of cases, it seems it happens to be the same and sometimes is, but it is not necessarily the same.

Just as context, some examples of the adapters I worked on:

  • the Dutch source reports PM10 and PM2.5 as 24 hour rolling averages, the others are hourly average.
  • the Australian data is also pretty detailed:
    image

If we don't explicitly know the intervalPeriod should we just use the reportingPeriod or leave it blank? If it's blank in most cases, is that something we'd be ok with?

Is reportingPeriod something that should be stored or derived from the data?

Maybe the thing to do is to have a) a reportingPeriod parameter and then another one indicating whether we have inferred the value from our initial analysis of the system or we know it explicitly, either from the source page itself or communication with a given agency, and then b) an averagingPeriod and associated parameter that can do the same.

I think in a lot of cases, we'll likely need to infer both the reportingPeriod and averagingPeriod (that's what we've been doing by and large to date, essentially)- and probably we'll be pretty accurate in doing so, but at least a user can see for themselves our inferences vs explicit information (if we go the route that makes it clear whether we inferred data or not, we'll probably need to do it with coordinates as well?).

I would think we don't want to derive the reportingPeriod from the data in some sort of continuous manner at least for scientific purposes, as when data drops out, it will look weird/misleading and make the data more difficult to use from that specific standpoint and at least a good chunk of use-cases I can think of. But, as @jflasher pointed out (and @olafveerman did at dinner), I see how this could be used to help test the system for periods when data fall out from a 'system health' standpoint - perhaps that's a separate something called averageReportingPeriod? I realize this is a bad and confusing for a name it. :)