Anna Review
Closed this issue · 0 comments
It is a great addition to rOpenSci and general movements towards both linked data and better curation, visibility and citability of software. Overall, the functions are smooth and easy to use. I think the most difficult part of the package is getting your head round the concepts. There is good
codemeta
andJSON-LD
documentation to whichcodemetar
documentation links to (I love Manu's videos!). I also realise that the purpose of package documentation is mainly to demonstrate use and not necessarily educate on the background. However I feel that a small amount of extra explanation and jargon busting could really help users get their head round what's going on and why it's such an important and awesome initiative!
Yay! Thanks so much. Yes, have tried to do this a bit more now, see below.
As mentioned above, here are some suggestions on how I feel the readme could be more informative and self contained:
I would have loved a short background section in the intro that could include a few key definitions (which could then be used consistently throughout) and a touch of historical context: eg. (sorry this are probably rubbish definitions but hopefully you get the gist!)
- Linked data: data that has a context which links the fields used in the data to an online agreed standard.
- context: mapping between a data source fields and a schema. Usually schema.org but domain specific ones also (eg codemeta)
Briefly explain the difference between the data types (ie json, codemeta json-ld, codemeta r list) so that users can be cognisant of them when using package.
Describe how project is the convergence of a number of initiatives:
- Schema.org: the main initiative to link data on the web through a catalogue of standard metadata fields
- codemeta: a later inititative to formalise the metadata fields included in typical software metadata records and introduce important fields that did not have clear equivalents. The codemeta crosswalk provides an explicit map between the metadata fields used by a broad range of software repositories, registries and archives
- JSON-LD: the data type enabling crosswalk through embedding contexts into the data itself.
- codemetar: mapping fields used to describe packages in r to the standard fields agreed in schema & codemeta <- consesus schema
Excellent suggestions!! Added all of this to both the (newly create) top-level documentation (#25) and the intro vignette.
Let's just say this package could act as a gateway drug to JSON-LD, it certainly has for me!
❤️ ❤️ ❤️
function documentation
codemeta_validate
- I think the description might be incomplete? (ie ...verifying that the result.... matches or conforms to sth?).
codemeta
argument description only mentions path/filename. The function, however, is geared up to accept aJSON-LD
, functionality demonstrated in the crosswalking vignette. So should probably be listed in the help file also.
Added to docs.
crosswalk
- Description in help file could be a touch more informative, eg: Crosswalk between different metadata fields used by different repositories, registries and archives. For more details see here.
- Also suggest that where the crosswalk table is mentioned, a link to it be supplied. e.g.
from
-> the corresponding column name from the crosswalk table.
Added, thanks!
- My understanding is that what is refered to here as a JSON list is the same as what is refered to in
create_codemeta
documentation as codemeta object (list)? I think standardisation of the terminology refering to different datatypes throughout documentation would be best.
Yeah, agree this is confusing. Technically they are both list
-class objects, but here it doesn't have to be a list representing 'codemeta' json, it could be other json.
This kind of relates to the bigger issue of how to refer to these things: JSON data in R can be in an external file (path or url), a string, or an R-list format. We try and make most of the functions agnostic to this, but it's still confusing.
Overall function documentation comments:
I wonder if it could be useful to make a distinction in the function names functions that work with codemeta JSON and codemeta r list objects. Let's say we associated
codemeta
with the JSON-LD format andcodemetar
with r lists. Using these associations in function names could make it a bit more explicit to the user. E.g. Functionswrite_codemeta
andvalidate_codemeta
would remain the same because they either output or work on JSON-LD formats butcreate_codemeta
could becomecreate_codemetar
to indicate that the output is an r list? I only mention it because it confused me a little at times but appreciate this is no biggie.
Yup, we've tried to clarify the write
/ create
distinction better in the function descriptions now.
Vignettes
Good scope and demonstration of utility of package. A few suggestion that as a novice to the concepts would have helped me a little.
In the index of vignettes, I think the sequence would work better as:
- intro
- Translating between schema using JSON-LD
- Validation in JSON-LD
- Parsing CodeMeta Data
and include a really short description of what each vignette contains (like a sentence)
Done! Note: the only way I could find to order vignettes was to prefix letters A, B, C, D to the names. Numeric codes, 01, 02, 03, throw a WARNING
in check. Any other solution would be greatly appreciated.
Intro
Really like ability to add through the DESCRIPTION file. Where you mention
See the DESCRIPTION file of the codemetar package for an example.
it could include an actual link to the DESCRIPTION file.
Done! (In README as well). *Side note: this section now also describes the CRAN-approved / compliant format for adding arbitrary schema.org terms into DESCRIPTION files).
Translating between schema using JSON-LD
I found this paragraph confusing and have suggested how it might make the concept clearer. Also I think URI needs a short definition.
Unrecognized properties are dropped, since there is no consensus context into which we can expand them. Second, the expanded terms are then compacted down into the new context (Zenodo in this case.) This time, any terms that are not part of the
codemetaZenodo context are kept,but not compacted,since they still have meaningful contexts (that is, full URIs) that can be associated with them, but not compacted:
Link added and paragraph revised, thanks!
Validation in JSON-LD
Motivating example slightly confusing because while this sentence mentions an error being thrown up, all that is returned in r is a
NULL
.However, there’s other data that is missing in our example that could potentially cause problems for our application. For instance, our first author lists no affiliation, so the following code throws an error:
Then when framing, the value to associate with the field is data missing is also
NULL
. I appreciate that the real value in the process is that the JSON-LD now contains and explicit field that contains a@null
value but it might be worth spelling it out because the actual output in r pre & post framing are the same, ieNULL
.
Yeah, clearly this isn't a good motivating example since R is basically already inferring the NULL correctly, just like you say. I think the other cases are a bit clearer so I've just removed this case.
A few super minor typos which I've corrected and happy to send through a pull request?
That would be awesome, thanks!
functions
I found out (well should have known) on a plane when I planned to work on this review that the functions require an internet connection. It actually got me wondering whether
internet-connection
might be a good thing to list more generally as part of software requirements?
Yeah, though this really applies primarily to the crosswalk function (where it is now mentioned) and then the json-ld functions from the vignettes; which aren't technically package functions. The JSON-LD functions only need the internet if/when the context
file is given as a remote URL; technically one can just embed the literal context file in the context
element, and then there's no need to resolve anything.
crosswalk
While it is relatively straight forward to get the crosswalk .csv, I feel it'd be good to be able to access information through the package. Here are some suggestions based on what I would find personally useful:
- At the very least to have a function that just fetches the
.csv
.- Moving beyond that it could be useful to have a few helper functions to quickly interrogate it?
+ - I'd find it quite useful to quickly get the options for argumentsto
andfrom
incrosswalk
. > Could be cool to have another function egcrosswalks
that prints the available crosswalk column options, eg:
library(readr)
crosswalks <- function(){
df <-
readr::read_csv(
"https://github.com/codemeta/codemeta/raw/master/crosswalk.csv",
col_types = cols(.default = "c"))
names(df)[!names(df) %in% c("Parent Type", "Property", "Type", "Description")]
}
crosswalks()
#> [1] "codemeta-V1"
#> [2] "DataCite"
#> [3] "OntoSoft"
#> [4] "Zenodo"
#> [5] "GitHub"
#> [6] "Figshare"
#> [7] "Software Ontology"
#> [8] "Software Discovery Index"
#> [9] "Dublin Core"
#> [10] "R Package Description"
#> [11] "Debian Package"
#> [12] "Python Distutils (PyPI)"
#> [13] "Trove Software Map"
#> [14] "Perl Module Description (CPAN::Meta)"
#> [15] "NodeJS"
#> [16] "Java (Maven)"
#> [17] "Octave"
#> [18] "Ruby Gem"
#> [19] "ASCL"
#> [20] "DOAP"
#> [21] "Wikidata"
- I also found the non-exported function
crosswalk_table
quite useful (some commented out code in there). Other's might too?
Nice. But your suggestion below is even better, so I've actually just renamed the function you give below as crosswalk_table
, since it serves both that same role (if to=NULL
), and the purpose you illustrate!
- But I feel the most useful would be to be able to narrow down field mappings between particular repositories of interest. So building on the
crosswalk_table
function, I would probably find the following functions quite useful:
library(readr)
crosswalk_map <- function(from, to,
full_crosswalk =
"https://github.com/codemeta/codemeta/raw/master/crosswalk.csv",
trim = FALSE){
df <-
readr::read_csv(full_crosswalk,
col_types = cols(.default = "c"))
df <- df[c("Property", from, to)]
if(trim) df <- df[!is.na(df[,from]),] # trim to `from` argument fields
df
}
crosswalk_map(from = "GitHub", to = c("Zenodo", "Figshare"), trim = T)
#> # A tibble: 11 x 4
#> Property GitHub Zenodo Figshare
#> <chr> <chr> <chr> <chr>
#> 1 codeRepository html_url relatedLink relatedLink
#> 2 programmingLanguage languages_url <NA> <NA>
#> 3 downloadUrl archive_url <NA> <NA>
#> 4 author login creators <NA>
#> 5 dateCreated created_at <NA> <NA>
#> 6 dateModified updated_at <NA> <NA>
#> 7 license license license License
#> 8 description description description/notes Description
#> 9 identifier id id <NA>
#> 10 name full_name title Title
#> 11 issueTracker issues_url <NA> <NA>
Great suggestion, this is now the (exported) function, crosswalk_table()
.
- BTW, it seems that
example_json_r_list %>% crosswalk("R Package Description")
is the only way to get from aJSON-LD r list
to aJSON-LD
in r, astoJSON
doesn't work. While its great that there is a way to do it, it almost feels a bit of a hack. At the very least I feel it should be included as an example in the vignette, so users are aware of it but I'm wondering if an explicit function for that might also be useful?
Hmm, toJSON
should work: create_codemeta() %>% toJSON()
seems happy for me. Can you give an example? Moving between list representation, string representation & external file representation is a bit confusing, but should be handled okay by the standard JSON & JSON-LD tools and not be specific to any codemeta
stuff...
write_codemeta
- When writing the file into a package (ie when
DESCRIPTION
is detected), adding"codemeta.json"
to.Rbuildignore
assumes that the user has not changed thepath
. While it is advised in the help file to leave as default, as the user can change it, there could theoretically be a situation where the user has called it something else but the function has written"codemeta.json"
to.Rbuildignore
. Just wondering whether that would cause any problems downstream?
Interesting question. The function simply uses the devtools
routine to add the file to the .Rbuildignore
list, which is already used by many other devtools
functions, so I figured it was better to be consistent with that function than attempt a custom solution. If this really is an issue I imagine it would first be encountered and patched upstream and we'd get the benefits that way.
- Also when supplying a
JSON-LD r list
instead of a path topkg
, function works but throws a warning:the condition has length > 1 and only the first element will be used
Thanks, should be fixed now.
Compliance with rOpenSci Packaging Guide
Contributing
- No contributing.md or anything about contributing in the README. However does include a good code of conduct.
Added
You can see more about my review here