Entity names
Closed this issue · 23 comments
I'm wondering if we want to have a standard for variable names. In PRIMAP1 it's all upper case letters. For PRIMAP2 we have specified a way to add GWP information to variable names, but no convention for the variables themselves. I think all uppercase is sometimes hard to read. I think we should have a specification to simplify running code on different datasets.
Usually, variables are gases, in which case it would make sense to use the same capitalization as openscm (e.g. CO2). What other entities are there? Gas baskets like F-gases and population, right?
there will be a lot of economical variables (different GDP variants, etc)
I agree on the variables available in openscm. But openscm doesn't have the baslets, right?
Nope, no baskets in openscm.
Is there some other standard (maybe from the IIASA universe) that we can follow? If not, we have to write one ourselves, but it would be less work if there is already something. (-:
I don't know for sure, but I think the IIASA databases have some standard. Though I doubt it's described somewhere
From pyam, I found this "standard": https://data.ene.iiasa.ac.at/database/
However, it doesn't actually define a lot of interesting variables and is very wordy. CF conventions unfortunately don't deal with any socio-economical variable names (and are very wordy for emissions).
FAOstat has entity lists, but they use codes instead of shorter names, which I think is pretty user-hostile (hey there, here you got data for EL-3148
).
Maybe we should make our own list? If so, then what should our rules be? Use normal english capitalization rules, so that we end up with population
, kyotoghg
F-gases
etc.?
I think that in case we use population, kyotoghg, F-gases, I would be for KyotoGHG (even though it is not great...)
And I also would vote against using codes as in FAOstat.
Would it make sense to have a list with normal English capitalization rules, but then convert it to uppercase for internal use, so that errors due to the wrong capitalization are not leading to a program breakdown?
The world bank also has an entity list, but I don't know if we want to use it, e.g. GDP (constant 2010 US$)
end up as NY.GDP.MKTP.KD
, which is totally non-transparent for me.
@AnnGuenther: Do you have a Reason for KyotoGHG? Just because Kyoto is a name and GHG is an abbreviation and therefore english capitalization rules yield KyotoGHG, or for another reason?
No other reason, just the ones you listed.
I don't really like silently correcting e.g. capitalization. A KeyError
when I use Population
instead of population
is pretty easy to interpret and the error is immediate. If we do have some normalization happening, we have to remember to do this normalization everywhere, otherwise using Population
works at first and breaks later, which will be harder to diagnose.
I've started adding entity names to a terminology over in climate_categories. So far, there are only emission rates of "gases" from openscm_units ("gases" is wrong here, because e.g. black carbon is not a gas, but what other word is more correct here?), but maybe you can have a look if the level of detail and general idea seems good.
its all bundled in a PR: primap-community/climate_categories#1
The definition is here: https://github.com/pik-primap/climate_categories/pull/1/files#diff-c28a5ab1cbffcb57d64c46d658e69f373450cce100d9a8f70c72b89648a45f16
Maybe we can continue the discussion in the pull request.
(climate) forcers or drivers instead of gases?
Did you take a look at the https://github.com/openENTRANCE/nomenclature project? Not sure whether it's as wordy as the Default|IPCC|Emissions|Inventories
.
@danielhuppmann is pretty keen on interop so he might have ideas.
Thanks for looping me in @rgieseke - had a look at the discussion so far and the referenced PR. Not sure whether I understand the objective here, but two (more concrete) references to related work.
- In openENTRANCE, we tried to formalize the (previously implicit) guidelines for variables names in the IIASA & IAMC universe, with an aim for readability. See here.
- The current variable definitions in openENTRANCE follow the IPCC SR15 scenario ensemble (which in turn is based on CD-LINKS, ADVANCE, ...). For the question at hand, see how Kyoto-GHG with a specific GWP conversion metric is named here
Hi,
thanks for chiming in!
Information for context: There are two things happening in primap2 land at the moment:
- We are looking to get all the terminologies from PRIMAP1 (not publicly available, I'm afraid) into primap2 so that we can read in all data that we need to.
- We figured that since there seems to be no easy-to-install python package which contains commonly used terminologies like the IPCC categories in computer-readable format, we should build one.
I had a look at the openENTRANCE/nomenclature project before embarking on building an own package, but as far as I could tell from the available documentation, the goal is different there. E.g., there is no hierarchy of IPCC1996 and IPCC2006 categories and I also would not be sure how it fits in your format (would category 1.A.3.b.iii
be Emissions|IPCC2006|1|A|3|b|iii
? And N2O emissions in this category would then be Emissions|IPCC2006|1|A|3|b|iii|N2O
?). For me, it looked like openENTRANCE/nomenclature is specifically for the openENTRANCE project and its data format with exactly six dimensions, but we needed something more general.
That said, we can look if we can re-use some of the definitions of openENTRANCE/nomenclature for primap2.
Cheers,
Mika
Thanks for the context! Don't want to overload this conversation, so my response is as concise as possible - and let's have a follow-up (spoken) discussion somewhere else if there is interest...
The goal of the openENTRANCE nomenclature:
- build a list of variables (definitions) starting from previous projects such that it can be extended in later projects
- readability is key - hence the yaml file format and very descriptive variable definitions
- the Python package is a utility to facilitate working with it - but if someone wants to use the yaml files in an R workflow or copy-paste to Excel, that is fine (and is going to happen, given our user base)
Re your question about 1.A.3.b.iii, I would implement in our yaml lingo as
Emissions|N2O|Energy|Transportation|Road|Heavy Duty Trucks and Buses:
definition: <bla>
unit: kt CO2e
ipcc_2006: 1.A.3.b.iii
notes: <bla>
You should also take a look at the OpenEnergyOntology (h/t @Ludee & @christian-rli) - they use a formal ontology framework to write their definitions and interrelations...
@danielhuppmann alerted me to this issue. To pick up on one point:
The world bank also has an entity list, but I don't know if we want to use it, e.g.
GDP (constant 2010 US$)
end up asNY.GDP.MKTP.KD
, which is totally non-transparent for me.
This is broader than the World Bank; it reflects the use of SDMX (https://sdmx.org/?page_id=5008, https://datahelpdesk.worldbank.org/knowledgebase/articles/1886701-sdmx-api-queries) which provides an information model that can cover most climate/energy/etc. use cases (at least, all that I've seen). A key like NY.GDP.MKTP.KD
might seem opaque per se, but I'd argue that it reflects a more mature, thoroughly-considered approach to problems that we often try, unnecessarily, to solve anew.
As briefly as possible:
- Specific data dimensions are linked to abstract concepts;
- In a particular data structure definition (DSD) / data sets that are “structured by” that DSD, that concept can be represented by codes from a particular codelist (browse many: https://registry.sdmx.org/items/codelist.html)
- Each code has an id (machine-readable), and optional, multilingual name, description, and annotations.
- Applications can decide whether to use/display the id or name; whichever is more suitable.
- The code lists and concept schemes are published (incl. versioned) and referenced from DSDs; and the DSDs are referenced by data.
NY.GDP.MKTP.KD
is a composite:
- There are 4 dimensions/concepts here, separated by
.
- The 2nd gives the thing measured.
GDP
is the ID (short, machine-readable) of one code; the plain-language name (in English) might be “Gross domestic product”. - The 4rd is the inflation method applied.
KD
is the ID; the English name might be "Constant 2010 US dollars".
So a different key/composite like NY.GDP.MKTP.CD
(also visible in the WB WDI glossary) conveys that 3 concepts are the same, but the last is different; CD
is the ID of a different code, with a different name (“Current dollars”).
Publishing and referring to such code lists is, IMO, much better than trying to cram all metadata into labels on every data set.
For instance, at https://registry.sdmx.org/items/codelist.html one can see the Eurostat (ESTAT
) code list for the "area" concept (CL_AREA
). Notice that they provide all possible definitions of the "EU". A reference to this code list, and the use of a code from this list, is 100% unambiguous about what is represented, while allowing precision and fine distinctions.
Over at transportenergy/database#62 we're trying to take this approach, namely:
- Define the distinct concepts relevant to some or all data, using IDs, names, and descriptions.
- Create (or use existing) code lists for each, again with IDs, names, descriptions, annotations, and sometimes hierarchy.
After having done so, it's certainly possible to:
- collapse the IDs of codes for multiple concepts into a key like
NY.GDP.MKTP.KD
, or - collapse the names into a “variable” name using some string formatting. (This is what the WB does for the WDIs; see https://databank.worldbank.org/AjaxDownload/FileDownloadHandler.ashx?filename=WB_WDI_DSD.xml&filetype=DSD).
But it's also possible to handle data in its original dimensions (one per distinct concept), or (as analysis requires) to restore those dimensions when receiving data that's labeled with a collapsed "variable name".
Apologies for a long comment!
@khaeru
Thank you for the pointers and explanations.
I am still a bit confused by the code concept there, for example, where can I find the dimensions/concepts to fully decode e.g. NY.GDP.MKTP.KD
? I browsed the sdmx websites and also the explanations at the world bank, and couldn't find any pointers what the "parts" of each code mean, always just what the full code means, and lots of useful, but not directly related ontologies at sdmx.
@mikapfl sorry, I should have included that URL: https://datahelpdesk.worldbank.org/knowledgebase/articles/201175-how-does-the-world-bank-code-its-indicators
To be clear, the World Bank uses these internally, but does not publish separate SDMX code lists for the constituent parts, because they don't intend to publish data for/support general public usage of all combinations. Instead (last URL in my first comment) they provide a code list called "SERIES" that includes some of these composite codes but also others, based on other schemes.
To expand a little on my point about "collapsing": for instance, if data (e.g. for a measure like <id=EMI, name=Emissions>) has conceptual dimensions like "Species" (coded as <id=CO2; name=Carbon Dioxide>, <id=N2O>, etc.) and "Sector" (coded as <id=T, name=Transport>, <id=A, name=Agriculture>, etc.):
Measure | Species | Sector | Value |
---|---|---|---|
Emissions | CO2 | T | 1.1 |
Emissions | N20 | T | 0.2 |
Emissions | CO2 | A | 3.3 |
Emissions | N20 | A | 0.4 |
…then, one defines a new code list "VARIABLE" using a simple & transparent algorithm, e.g.:
for measure, species, sector in product(…):
# Mixing IDs and names is fine, according to need, as long as we're clear what is done
id = f"{measure.name}|{species.id}|{sector.id}"
name = f"{measure.name} of {species.name.lower()} from {sector.name.lower()}"
# Store the mapping to full dimensions in the description; this could be done in several ways
description=f'{MEASURE="measure.id", SPECIES="{species.id}", SECTOR="{sector.id}"
# (create and store a code)
…giving:
- "Emissions|CO2|T"
- "Emissions of carbon dioxide from transportation"
{MEASURE="EMI", SPECIES="CO2", SECTOR="T"}
Then publish the data with 3 distinct conceptual dimensions collapsed to 1:
Variable | Value |
---|---|
Emissions|CO2|T | 1.1 |
Emissions|N20|T | 0.2 |
Emissions|CO2|A | 3.3 |
Emissions|N20|A | 0.4 |
…and the VARIABLE codelist, which includes all the information needed for users to restore the actual dimensions, if they want.
This is what we see from the World Bank: NY.GDP.MKTP.KD
is analogous to Emissions|CO2|T
.
Among other reasons, I think this approach can cover the common case in energy/climate where we include multiple measures in the same data set for which different concepts/dimensions are relevant. (For instance, the "Species" concept/dimension is relevant for the "Emissions" measure, but not for "Population".) Other solutions I've seen include (a) add many columns for every dimension relevant to any one measure (overkill) and (b) split to distinct data sets/data flows, one for each measure, with the appropriate dimensions for each (SDMX does support this, but I realize it's beyond capacity for most of us at this moment).
Additionally to a convention on names it would be great to have lists of other names for the entities as e.g. f-gases often have several names referring to the same gas which each have different notations.
I think this can be closed and any further discussions can happen over in https://github.com/pik-primap/climate_categories . Unless you disagree @JGuetschow, I'll close this end of next week.