The-Academic-Observatory/academic-observatory-workflows

Suggestion to include variable 'subtype' from Crossref

Opened this issue · 4 comments

Crossref metadata contain the variable 'subtype' for records with publication type 'posted-content'. Including this variable allows e.g. distinguishing preprints from other types of posted-content in downstream analysis.

cc @cameronneylon

NB In the JSON response of the Crossref REST API, this a top-level field, see e.g. https://api.crossref.org/works/10.5194/acp-2015-1010 line 214

However, Crossref describes it here as as an attribute of posted-content called 'type': https://crossref.org/documentation/content-registration/content-type-markup-guide/posted-content-includes-preprints/#00084, and this is also how it's documented in the schema documentation: https://data.crossref.org/reports/help/schema_doc/4.4.2/schema_4_4_2.html#posted_content

So... my simple thought was that it could be added to
observatory-dags/observatory/dags/database/schema/crossref_metadata_2021-01-01.json as such:
{
"mode": "NULLABLE",
"name": "subtype",
"type": "STRING",
"description": "Enumeration, one of the type ids from https://data.crossref.org/reports/help/schema_doc/4.4.2/schema_4_4_2.html#posted_content."
},

... but that assumes the telescope workflow is using the REST API json result structure.

See pull request The-Academic-Observatory/observatory-platform#456

Hi @bmkramer,

Thanks for your pull request!
I can see we currently have the 'type' field in our schema, for which one of the options is 'posted-content'.

I haven't seen the 'subtype' field in our data so far though. The way our data pipeline is set up it should give an error when there is a field in the data that is not in our schema. The latest snapshot from 2021-05-01 was loaded into BigQuery without any issues.

We get our data from the Crossref Metadata Plus snapshots that are available (https://www.crossref.org/documentation/metadata-plus/metadata-plus-snapshots/). It's in the json format and uses the REST API, but with the /snapshot route, so perhaps a different schema is used for the output there?

The 'subtype' field is not listed in the document here: https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md, but perhaps this document is outdated.

Do you our @cameronneylon know who would be the best person to contact, so I could ask some questions about which schema is used for the snapshots and whether we should be getting the 'subtype' field as well?

EDIT: I found the same work from your example (https://api.crossref.org/works/10.5194/acp-2015-1010) in the metadata plus snapshot and there is no subtype field there. I suspect the schema for the snapshot route of the API is slightly different and @cameronneylon Is looking further into this.

Thanks @aroelo , and yes, we have asked Crossref about this. To be continued!

Update:
After contact with Crossref they informed us that this field should be included soon in the new snapshots, either in October or September.
The new snapshots should pull data directly from the REST API, so I assume that these will then be similar in format.