Mentions of Software R
Irazall opened this issue · 7 comments
Thank you for implementing and sharing this great algorithm! I use it to detect software mentions in economics.
I realized that mentions of the software R are missing. I see some indirect mentions when mentioning R-packages like quantreg but this is not linked to R and I need to create the link manually. Do I somehow miss something or is this feature not fully implemented?
Great! We should have mentions for the R language itself, named R packages, and either named or unnamed scripts where the text says that they were written in R.
R is mentioned in a few contexts in the schema (which I'll use this question to document better). https://github.com/softcite/software-mentions/blob/master/doc/annotation_schema.md
What dataset are you using? Let's look at the PLoS dump at https://science-miner.s3.us-west-2.amazonaws.com/datasets/allOfPLOS-software-annotations_2023-05-21.zip Look for wikidataId="Q206904"
which identifies the R-language https://www.wikidata.org/wiki/Q206904
I see it showing up in "software-type": "environment"
(for R packages) and in language
associated with a "software-type": "implicit"
(meaning an unnamed script).
@kermitt2 can you help us look at these examples below, from the PLoS dump?
{
"type": "software",
"software-type": "environment",
"wikidataId": "Q206904",
"wikipediaExternalRef": 376707,
"lang": "en",
"confidence": 0.3013,
"software-name": {
"rawForm": "R statistics",
"normalizedForm": "R statistics",
"wikidataId": "Q206904",
"wikipediaExternalRef": 376707,
"lang": "en",
"confidence": 0.3013,
"offsetStart": 71,
"offsetEnd": 83
},
"context": "We used standard non-parametric statistical tests calculated using the R statistics software. ",
...
}
{
"type": "software",
"software-type": "environment",
"wikidataId": "Q206904",
"wikipediaExternalRef": 376707,
"lang": "en",
"confidence": 0.8246,
"software-name": {
"rawForm": "R statistical software",
"normalizedForm": "R statistical software",
"wikidataId": "Q206904",
"wikipediaExternalRef": 376707,
"lang": "en",
"confidence": 0.8246,
"offsetStart": 88,
"offsetEnd": 110
},
"context": "The entire inference and simulation process was performed using a tailored algorithm in R statistical software [67]."
...
}
{
"type": "software",
"software-type": "implicit",
"wikidataId": "Q187432",
"wikipediaExternalRef": 21490336,
"lang": "en",
"confidence": 0.5074,
"software-name": {
"rawForm": "scripts",
"normalizedForm": "scripts",
"wikidataId": "Q187432",
"wikipediaExternalRef": 21490336,
"lang": "en",
"confidence": 0.5074,
"offsetStart": 14,
"offsetEnd": 21
},
"language": {
"rawForm": "R",
"normalizedForm": "R",
"wikidataId": "Q206904",
"offsetStart": 12,
"offsetEnd": 13
},
"context": "We modified R scripts available from the supplemental material for Drummond et al. 2006 for partial correlation (factoring out only expression) and principal component regression analysis [39]. "
...
}
This one below seems a bit odd to me? Do you interpret this as the R-package Rgetmstatistic
which is running in the environment R-language (Q206904)? But it has the software-name.normalizedForm = "R package"
rather than Rgetmstatistic
which is showing up as the Publisher. Likely this one is an error?
Can you help find a canonical extraction of an R package and show the schema?
{
"type": "software",
"software-type": "environment",
"wikidataId": "Q206904",
"wikipediaExternalRef": 376707,
"lang": "en",
"confidence": 0.6317,
"software-name": {
"rawForm": "R package",
"normalizedForm": "R package",
"wikidataId": "Q206904",
"wikipediaExternalRef": 376707,
"lang": "en",
"confidence": 0.6317,
"offsetStart": 17,
"offsetEnd": 26
},
"publisher": {
"rawForm": "Rgetmstatistic",
"normalizedForm": "Rgetmstatistic",
"offsetStart": 28,
"offsetEnd": 42
},
"context": "hod. Additionally, an R package (Rgetmstatistic) for getmstatistic has been develo",
...
}
I grep'd for tidyverse and found this mention. My read is that we don't specify anything about tidyverse being related to the R-language because it isn't mentioned in the text, right? So a user would need an external source for that knowledge (package list, or tracking via Wikidata from Q60755534 to Q206904?)
{
"type": "software",
"software-type": "software",
"wikidataId": "Q60755534",
"wikipediaExternalRef": 59164935,
"lang": "en",
"confidence": 0.7712,
"software-name": {
"rawForm": "tidyverse",
"normalizedForm": "tidyverse",
"wikidataId": "Q60755534",
"wikipediaExternalRef": 59164935,
"lang": "en",
"confidence": 0.7712,
"offsetStart": 22,
"offsetEnd": 31
},
"context": "[53] with the package tidyverse [54]. Shannon\u2019s H and Simpson\u2019s D were calculated using the Biodiversity R package [55]. Plots were produced with the R package ggplot2 [56]. Regression lines in Figs 6 and 7 were smoothed using Locally Weighted Scatterplot Smoothing (LOESS) using provided by ggplot2 option \u2018goem_smooth(method = \u2018loess\u2019)."
}
Thank you, James!
These examples are quite helpful, especially since the JSON part "language" is not mentioned in the schema. Also, I guess the distinction between language and software is not as clean. R is a software and a language. I wouldn't define a package as a software, especially as one uses packages in Stata as well but in general you would mention Stata as the software and not the package.
Unfortunately, I am not able to share a file from my data since we scraped the data by ourselves but the below an extraction from this article (I removed the pages part). Some attributes that these are packages in R would be nice!
{
"application":"software-mentions",
"version":"0.7.3-SNAPSHOT",
"date":"2023-09-11T09:42+0000",
"md5":"862FF5201835CF3D498D742B850D7C22",
"mentions":[
{
"type":"software",
"software-type":"component",
"software-name":{
"rawForm":"truncSP",
"normalizedForm":"truncSP",
"offsetStart":40,
"offsetEnd":47,
"boundingBoxes":[
{
"p":9,
"x":253.468,
"y":525.231,
"w":30.2708,
"h":7.3804
}
]
},
"context":"These are calculated with the R package truncSP (Karlsson and Lindmark 2014).",
"mentionContextAttributes":{
"used":{
"value":true,
"score":0.9997476935386658
},
"created":{
"value":false,
"score":3.337860107421875e-06
},
"shared":{
"value":false,
"score":1.1920928955078125e-07
}
},
"documentContextAttributes":{
"used":{
"value":true,
"score":0.9997476935386658
},
"created":{
"value":false,
"score":3.337860107421875e-06
},
"shared":{
"value":false,
"score":1.1920928955078125e-07
}
},
"references":[
{
"label":"(Karlsson and Lindmark 2014)",
"normalizedForm":"Karlsson and Lindmark 2014",
"refKey":33,
"offsetStart":20899,
"offsetEnd":20928,
"boundingBoxes":[
{
"p":9,
"x":285.556,
"y":525.231,
"w":89.98570000000001,
"h":7.380400000000009
},
{
"p":9,
"x":78.6896,
"y":535.209,
"w":18.743250000000003,
"h":7.380400000000009
}
]
}
]
},
{
"type":"software",
"software-type":"component",
"software-name":{
"rawForm":"quantreg",
"normalizedForm":"quantreg",
"offsetStart":43,
"offsetEnd":51,
"boundingBoxes":[
{
"p":11,
"x":262.029,
"y":515.312,
"w":31.781599999999997,
"h":7.3804
}
]
},
"context":"8. These are calculated with the R package quantreg.",
"mentionContextAttributes":{
"used":{
"value":true,
"score":0.9996234178543091
},
"created":{
"value":false,
"score":2.1457672119140625e-06
},
"shared":{
"value":false,
"score":1.1920928955078125e-07
}
},
"documentContextAttributes":{
"used":{
"value":true,
"score":0.9996234178543091
},
"created":{
"value":false,
"score":2.1457672119140625e-06
},
"shared":{
"value":false,
"score":1.1920928955078125e-07
}
}
}
],
"references":[
{
"refKey":33,
"tei":"<biblStruct xml:id=\"b33\">\n\t<analytic>\n\t\t<title level=\"a\" type=\"main\"><b>truncSP</b>: An<i>R</i>Package for Estimation of Semi-Parametric Truncated Linear Regression Models</title>\n\t\t<author>\n\t\t\t<persName><forename type=\"first\">Maria</forename><surname>Karlsson</surname></persName>\n\t\t</author>\n\t\t<author>\n\t\t\t<persName><forename type=\"first\">Anita</forename><surname>Lindmark</surname></persName>\n\t\t</author>\n\t\t<idno type=\"DOI\">10.18637/jss.v057.i14</idno>\n\t</analytic>\n\t<monogr>\n\t\t<title level=\"j\">Journal of Statistical Software</title>\n\t\t<title level=\"j\" type=\"abbrev\">J. Stat. Soft.</title>\n\t\t<idno type=\"ISSNe\">1548-7660</idno>\n\t\t<imprint>\n\t\t\t<biblScope unit=\"volume\">57</biblScope>\n\t\t\t<biblScope unit=\"issue\">14</biblScope>\n\t\t\t<biblScope unit=\"page\" from=\"1\" to=\"19\" />\n\t\t\t<date type=\"published\" when=\"2014\">2014</date>\n\t\t\t<publisher>Foundation for Open Access Statistic</publisher>\n\t\t</imprint>\n\t</monogr>\n</biblStruct>\n"
}
],
"runtime":55925,
"id":"1fe9a0f6b03de6739591627056b8f6baa17fbd79",
"metadata":{
"id":"1fe9a0f6b03de6739591627056b8f6baa17fbd79"
},
"original_file_path":"qje_131_4_4.pdf",
"file_name":"qje_131_4_4.pdf"
}
Yes, I see your point. Need to document language
and software-type = "component"
and make it clear what should be there.
Perhaps some regex post processing could help you? Look at the context
and identify "R package" (then create the language
section?). Or do a lookup against CRAN?
This section should explain what we mean by software (these are the annotation guidelines for the underlying gold standard set).
https://github.com/softcite/software-mentions/blob/master/doc/annotation_schema.md
Hi @Irazall ! Which version of softcite mention recognizer are you using?
In principle, "R" should be recognized as software "environment" type and the package as software "component" type (running in the R software environment).
For example, when I try your sentences in isolation on the online demo https://cloud.science-miner.com/software/, I obtain:
{
"application": "software-mentions",
"version": "0.8.0-SNAPSHOT",
"date": "2023-09-16T21:35+0000",
"mentions": [
{
"type": "software",
"software-type": "environment",
"wikidataId": "Q206904",
"wikipediaExternalRef": 376707,
"lang": "en",
"confidence": 0.5056,
"software-name": {
"rawForm": "R",
"normalizedForm": "R",
"wikidataId": "Q206904",
"wikipediaExternalRef": 376707,
"lang": "en",
"confidence": 0.5056,
"offsetStart": 30,
"offsetEnd": 31
},
"context": "These are calculated with the R package truncSP (Karlsson and Lindmark 2014).",
"mentionContextAttributes": {
"used": {
"value": true,
"score": 0.9997476935386658
},
"created": {
"value": false,
"score": 0.000003337860107421875
},
"shared": {
"value": false,
"score": 1.1920928955078125e-7
}
},
"documentContextAttributes": {
"used": {
"value": true,
"score": 0.9997476935386658
},
"created": {
"value": false,
"score": 0.000003337860107421875
},
"shared": {
"value": false,
"score": 1.1920928955078125e-7
}
}
},
{
"type": "software",
"software-type": "component",
"software-name": {
"rawForm": "truncSP",
"normalizedForm": "truncSP",
"offsetStart": 40,
"offsetEnd": 47
},
"context": "These are calculated with the R package truncSP (Karlsson and Lindmark 2014).",
"mentionContextAttributes": {
"used": {
"value": true,
"score": 0.9997476935386658
},
"created": {
"value": false,
"score": 0.000003337860107421875
},
"shared": {
"value": false,
"score": 1.1920928955078125e-7
}
},
"documentContextAttributes": {
"used": {
"value": true,
"score": 0.9997476935386658
},
"created": {
"value": false,
"score": 0.000003337860107421875
},
"shared": {
"value": false,
"score": 1.1920928955078125e-7
}
}
}
],
"runtime": 1662
}
{
"application": "software-mentions",
"version": "0.8.0-SNAPSHOT",
"date": "2023-09-16T21:36+0000",
"mentions": [
{
"type": "software",
"software-type": "environment",
"wikidataId": "Q206904",
"wikipediaExternalRef": 376707,
"lang": "en",
"confidence": 0.5056,
"software-name": {
"rawForm": "R",
"normalizedForm": "R",
"wikidataId": "Q206904",
"wikipediaExternalRef": 376707,
"lang": "en",
"confidence": 0.5056,
"offsetStart": 33,
"offsetEnd": 34
},
"context": "8. These are calculated with the R package quantreg.",
"mentionContextAttributes": {
"used": {
"value": true,
"score": 0.9996234178543091
},
"created": {
"value": false,
"score": 0.0000021457672119140625
},
"shared": {
"value": false,
"score": 1.1920928955078125e-7
}
},
"documentContextAttributes": {
"used": {
"value": true,
"score": 0.9996234178543091
},
"created": {
"value": false,
"score": 0.0000021457672119140625
},
"shared": {
"value": false,
"score": 1.1920928955078125e-7
}
}
},
{
"type": "software",
"software-type": "component",
"software-name": {
"rawForm": "quantreg",
"normalizedForm": "quantreg",
"offsetStart": 43,
"offsetEnd": 51
},
"context": "8. These are calculated with the R package quantreg.",
"mentionContextAttributes": {
"used": {
"value": true,
"score": 0.9996234178543091
},
"created": {
"value": false,
"score": 0.0000021457672119140625
},
"shared": {
"value": false,
"score": 1.1920928955078125e-7
}
},
"documentContextAttributes": {
"used": {
"value": true,
"score": 0.9996234178543091
},
"created": {
"value": false,
"score": 0.0000021457672119140625
},
"shared": {
"value": false,
"score": 1.1920928955078125e-7
}
}
}
],
"runtime": 912
}
I could not use JSTOR version of the article, but I processed this one: https://www.econstor.eu/bitstream/10419/109714/1/820510270.pdf
(truncSP
is not mentioned in this PDF version, only quantreg
)
@Irazall @jameshowison This illustrates how "R" and the associated package should appear in the results if everything is working well. "R" is always used as "software environment" (running the R package) and not just as language attribute.
In the latest recognizer version (on docker grobid/software-mentions:0.8.0-SNAPSHOT
), results are not bad as you can see above, but there's room for improvement for the recognition of R packages (we always expect 2 software mentions, R environment + R package).
I observed that the extractions for "R" as environment are sometimes overlooked by the ML model, because the current version of the annotated corpus contains pretty old documents (often before 2010) and R was not used that much at the time. So "R" packages in general are currently under-represented and the recognition is not as robust and good as we would expect.
Perhaps some regex post processing could help you? Look at the context and identify "R package" (then create the language section?). Or do a lookup against CRAN?
I don't like regex, it's like hacking rather than really addressing the problem ! And it's hard to maintain over time. The next version of the Softcite corpus will contain more recent additional documents with more occurrences of R packages, so hopefully R environment and packages will be more reliably and systematically recognized.
Thank you all for your valuable insights!
I use version 0.7.3. Maybe this explains the differences.
For the meanwhile, I indeed help myself using some regex onto the context part of the JSON schema. As this actually only one figure in bigger landscape paper about economics, this should be fine but I wanted you to hint in the direction of R as this language (or environment ;) ) gets bigger and bigger and thus needs more attention.