ropensci/codemetar

Anna Review

Closed this issue · 0 comments

It is a great addition to rOpenSci and general movements towards both linked data and better curation, visibility and citability of software. Overall, the functions are smooth and easy to use. I think the most difficult part of the package is getting your head round the concepts. There is good codemeta and JSON-LD documentation to which codemetar documentation links to (I love Manu's videos!). I also realise that the purpose of package documentation is mainly to demonstrate use and not necessarily educate on the background. However I feel that a small amount of extra explanation and jargon busting could really help users get their head round what's going on and why it's such an important and awesome initiative!

Yay! Thanks so much. Yes, have tried to do this a bit more now, see below.

As mentioned above, here are some suggestions on how I feel the readme could be more informative and self contained:
I would have loved a short background section in the intro that could include a few key definitions (which could then be used consistently throughout) and a touch of historical context: eg. (sorry this are probably rubbish definitions but hopefully you get the gist!)

  • Linked data: data that has a context which links the fields used in the data to an online agreed standard.
  • context: mapping between a data source fields and a schema. Usually schema.org but domain specific ones also (eg codemeta)

Briefly explain the difference between the data types (ie json, codemeta json-ld, codemeta r list) so that users can be cognisant of them when using package.

Describe how project is the convergence of a number of initiatives:
  • Schema.org: the main initiative to link data on the web through a catalogue of standard metadata fields
  • codemeta: a later inititative to formalise the metadata fields included in typical software metadata records and introduce important fields that did not have clear equivalents. The codemeta crosswalk provides an explicit map between the metadata fields used by a broad range of software repositories, registries and archives
  • JSON-LD: the data type enabling crosswalk through embedding contexts into the data itself.
  • codemetar: mapping fields used to describe packages in r to the standard fields agreed in schema & codemeta <- consesus schema

Excellent suggestions!! Added all of this to both the (newly create) top-level documentation (#25) and the intro vignette.

Let's just say this package could act as a gateway drug to JSON-LD, it certainly has for me!

❤️ ❤️ ❤️

function documentation

codemeta_validate
  • I think the description might be incomplete? (ie ...verifying that the result.... matches or conforms to sth?).
  • codemeta argument description only mentions path/filename. The function, however, is geared up to accept a JSON-LD, functionality demonstrated in the crosswalking vignette. So should probably be listed in the help file also.

Added to docs.

crosswalk
  • Description in help file could be a touch more informative, eg: Crosswalk between different metadata fields used by different repositories, registries and archives. For more details see here.
  • Also suggest that where the crosswalk table is mentioned, a link to it be supplied. e.g. from -> the corresponding column name from the crosswalk table.

Added, thanks!

  • My understanding is that what is refered to here as a JSON list is the same as what is refered to in create_codemeta documentation as codemeta object (list)? I think standardisation of the terminology refering to different datatypes throughout documentation would be best.

Yeah, agree this is confusing. Technically they are both list-class objects, but here it doesn't have to be a list representing 'codemeta' json, it could be other json.

This kind of relates to the bigger issue of how to refer to these things: JSON data in R can be in an external file (path or url), a string, or an R-list format. We try and make most of the functions agnostic to this, but it's still confusing.

Overall function documentation comments:

I wonder if it could be useful to make a distinction in the function names functions that work with codemeta JSON and codemeta r list objects. Let's say we associated codemeta with the JSON-LD format and codemetar with r lists. Using these associations in function names could make it a bit more explicit to the user. E.g. Functions write_codemeta and validate_codemeta would remain the same because they either output or work on JSON-LD formats but create_codemeta could become create_codemetar to indicate that the output is an r list? I only mention it because it confused me a little at times but appreciate this is no biggie.

Yup, we've tried to clarify the write / create distinction better in the function descriptions now.

Vignettes

Good scope and demonstration of utility of package. A few suggestion that as a novice to the concepts would have helped me a little.

In the index of vignettes, I think the sequence would work better as:

  1. intro
  2. Translating between schema using JSON-LD
  3. Validation in JSON-LD
  4. Parsing CodeMeta Data

and include a really short description of what each vignette contains (like a sentence)

Done! Note: the only way I could find to order vignettes was to prefix letters A, B, C, D to the names. Numeric codes, 01, 02, 03, throw a WARNING in check. Any other solution would be greatly appreciated.

Intro

Really like ability to add through the DESCRIPTION file. Where you mention

See the DESCRIPTION file of the codemetar package for an example.
it could include an actual link to the DESCRIPTION file.

Done! (In README as well). *Side note: this section now also describes the CRAN-approved / compliant format for adding arbitrary schema.org terms into DESCRIPTION files).

Translating between schema using JSON-LD

I found this paragraph confusing and have suggested how it might make the concept clearer. Also I think URI needs a short definition.

Unrecognized properties are dropped, since there is no consensus context into which we can expand them. Second, the expanded terms are then compacted down into the new context (Zenodo in this case.) This time, any terms that are not part of the codemeta Zenodo context are kept, but not compacted, since they still have meaningful contexts (that is, full URIs) that can be associated with them, but not compacted:

Link added and paragraph revised, thanks!

Validation in JSON-LD

Motivating example slightly confusing because while this sentence mentions an error being thrown up, all that is returned in r is a NULL.

However, there’s other data that is missing in our example that could potentially cause problems for our application. For instance, our first author lists no affiliation, so the following code throws an error:

Then when framing, the value to associate with the field is data missing is also NULL. I appreciate that the real value in the process is that the JSON-LD now contains and explicit field that contains a @null value but it might be worth spelling it out because the actual output in r pre & post framing are the same, ie NULL.

Yeah, clearly this isn't a good motivating example since R is basically already inferring the NULL correctly, just like you say. I think the other cases are a bit clearer so I've just removed this case.

A few super minor typos which I've corrected and happy to send through a pull request?

That would be awesome, thanks!

functions

I found out (well should have known) on a plane when I planned to work on this review that the functions require an internet connection. It actually got me wondering whether internet-connection might be a good thing to list more generally as part of software requirements?

Yeah, though this really applies primarily to the crosswalk function (where it is now mentioned) and then the json-ld functions from the vignettes; which aren't technically package functions. The JSON-LD functions only need the internet if/when the context file is given as a remote URL; technically one can just embed the literal context file in the context element, and then there's no need to resolve anything.

crosswalk

While it is relatively straight forward to get the crosswalk .csv, I feel it'd be good to be able to access information through the package. Here are some suggestions based on what I would find personally useful:

  • At the very least to have a function that just fetches the .csv.
  • Moving beyond that it could be useful to have a few helper functions to quickly interrogate it?
    + - I'd find it quite useful to quickly get the options for arguments to and from in crosswalk. > Could be cool to have another function eg crosswalks that prints the available crosswalk column options, eg:
library(readr)
crosswalks <- function(){
    df <-
        readr::read_csv(
            "https://github.com/codemeta/codemeta/raw/master/crosswalk.csv",
            col_types = cols(.default = "c"))
    names(df)[!names(df) %in% c("Parent Type", "Property", "Type", "Description")]
}

crosswalks()
#>  [1] "codemeta-V1"                         
#>  [2] "DataCite"                            
#>  [3] "OntoSoft"                            
#>  [4] "Zenodo"                              
#>  [5] "GitHub"                              
#>  [6] "Figshare"                            
#>  [7] "Software Ontology"                   
#>  [8] "Software Discovery Index"            
#>  [9] "Dublin Core"                         
#> [10] "R Package Description"               
#> [11] "Debian Package"                      
#> [12] "Python Distutils (PyPI)"             
#> [13] "Trove Software Map"                  
#> [14] "Perl Module Description (CPAN::Meta)"
#> [15] "NodeJS"                              
#> [16] "Java (Maven)"                        
#> [17] "Octave"                              
#> [18] "Ruby Gem"                            
#> [19] "ASCL"                                
#> [20] "DOAP"                                
#> [21] "Wikidata"
  • I also found the non-exported function crosswalk_table quite useful (some commented out code in there). Other's might too?

Nice. But your suggestion below is even better, so I've actually just renamed the function you give below as crosswalk_table, since it serves both that same role (if to=NULL), and the purpose you illustrate!

  • But I feel the most useful would be to be able to narrow down field mappings between particular repositories of interest. So building on the crosswalk_table function, I would probably find the following functions quite useful:
library(readr)
crosswalk_map <- function(from, to, 
                            full_crosswalk =
  "https://github.com/codemeta/codemeta/raw/master/crosswalk.csv",
  trim = FALSE){
  df <-
    readr::read_csv(full_crosswalk,
             col_types = cols(.default = "c"))
  df <- df[c("Property", from, to)]
  if(trim) df <- df[!is.na(df[,from]),] # trim to `from` argument fields
  df
}

crosswalk_map(from = "GitHub", to = c("Zenodo", "Figshare"), trim = T)
#> # A tibble: 11 x 4
#>               Property        GitHub            Zenodo    Figshare
#>                  <chr>         <chr>             <chr>       <chr>
#>  1      codeRepository      html_url       relatedLink relatedLink
#>  2 programmingLanguage languages_url              <NA>        <NA>
#>  3         downloadUrl   archive_url              <NA>        <NA>
#>  4              author         login          creators        <NA>
#>  5         dateCreated    created_at              <NA>        <NA>
#>  6        dateModified    updated_at              <NA>        <NA>
#>  7             license       license           license     License
#>  8         description   description description/notes Description
#>  9          identifier            id                id        <NA>
#> 10                name     full_name             title       Title
#> 11        issueTracker    issues_url              <NA>        <NA>

Great suggestion, this is now the (exported) function, crosswalk_table().

  • BTW, it seems that example_json_r_list %>% crosswalk("R Package Description") is the only way to get from a JSON-LD r list to a JSON-LD in r, as toJSON doesn't work. While its great that there is a way to do it, it almost feels a bit of a hack. At the very least I feel it should be included as an example in the vignette, so users are aware of it but I'm wondering if an explicit function for that might also be useful?

Hmm, toJSON should work: create_codemeta() %>% toJSON() seems happy for me. Can you give an example? Moving between list representation, string representation & external file representation is a bit confusing, but should be handled okay by the standard JSON & JSON-LD tools and not be specific to any codemeta stuff...

write_codemeta
  • When writing the file into a package (ie when DESCRIPTION is detected), adding "codemeta.json" to .Rbuildignore assumes that the user has not changed the path. While it is advised in the help file to leave as default, as the user can change it, there could theoretically be a situation where the user has called it something else but the function has written "codemeta.json" to .Rbuildignore. Just wondering whether that would cause any problems downstream?

Interesting question. The function simply uses the devtools routine to add the file to the .Rbuildignore list, which is already used by many other devtools functions, so I figured it was better to be consistent with that function than attempt a custom solution. If this really is an issue I imagine it would first be encountered and patched upstream and we'd get the benefits that way.

  • Also when supplying a JSON-LD r list instead of a path to pkg, function works but throws a warning: the condition has length > 1 and only the first element will be used

Thanks, should be fixed now.

Compliance with rOpenSci Packaging Guide

Contributing
  • No contributing.md or anything about contributing in the README. However does include a good code of conduct.

Added

You can see more about my review here