JOSS review

Question

JOSS review

Closed this issue 7 years ago · 3 comments

Review Comments

This package allows users to extract and clean a very unique piece of clinical data for the CC-HIC datasets. Since the original dataset contains multiple complicated forms of data, it could be a very complicated process to start everything from scratch. I can imagine this package can be very helpful for many useRs in that system.

I personally think this package agrees with rOpenSci's fit policy and is nice to be listed. However, I also want to note that maybe rOpenHealth fits this topic better?

We didn't find more information about rOpenHealth online. I suppose it is a sub-branch of rOpenSci. Would like to know more about the organization. We are happy to be listed where it is appropriate if no extra work will be incurred.

I do have a few comments after trying out this package.

As @noamross mentioned, regarding the package name, I would recommend changing the name to cchicr or else instead of cleanEHR due to the fact that this is only one particular form of electronic health record.

First of all, we do not only aim to deal with the CCHIC data set, instead, our ambition is to use CCHIC to introduce the data processing standard for highly heterogeneous health care data. cleanEHR will be served as the key of reproduce our work for other highly heterogeneous database. CCHIC is a theme of national institute for health research (NIHR) health information collaborative HIC, which covers large areas other than intensive care and will be dealing with data which are different in both structure and content. cleanEHR will look into how to accommodate other data set and even other type of data in the future.

rOpenSci packaging guide recommends using xml2 instead of XML for parsing XML files. It seems to me that most of the XML-related codes in this package only involves parsing. As a result, I would recommend using xml2 instead.

Trop compliqué

This package is frequently using a very OO-ish programming style to define objects and write functions. For example, after a ccTable being created, you can use cctable$export.csv("file.csv") to write contents to a csv file (https://github.com/CC-HIC/cleanEHR/blob/master/R/ccTable.R#L206-L213). To me, due to my lack of understanding of the full scenario, it seems sort of like forcibly writing python in R and is a little redundant. (I'm not saying it's a bad practice. My concern is mainly about the lack of documentations for these methods as I said below)
Also, it would be nice if we can see more documentations for these self-defined methods or at least a table of available methods with brief explanation. In this case, those methods are also "exported functions" as well even though they weren't exported officially. That's why I'm putting a ? mark on the checklist
I would like to see more documentation about pipeline. I feel like the current documentation in R/pipeline.R and inst/pipeline isn't quite enough. It would be nice to see a README in inst/pipeline.

pipeline is removed from the list.

In the package vignette tour.Rmd, we see tb <- create.cctable(ccd, yaml.load(conf), freq=1). However, freq is actually the 2nd argument in that function. It might mislead some users in the future.

done

I feel like the author can put more instructions in the package vignette.

have redone the vignette.

Final point, I'm not sure if it's a good idea but I feel like building a shiny app to deal with these tasks will be very helpful, especially to help users to understand what kinds of functionalities are available in this package.

Yes we are building it.

Answer 1 · 2017-08-23T06:55:15.000Z

Aammd

This is an intricate and impressive package! It involves data.table, yaml, S4 classes and XML. ; it is the All Dressed chip of data accessing packages! Unfortunately, I am not much of an expert in data.table, nor in the creation and documentation of S4 classes. I have mostly reviewed for usability, documentation and some coding best practices
Package Review

Perhaps the biggest opportunity for improvement lies in the package vignette. Here are some thoughts that came to mind as I was reading through it.
A clearer description of exactly what kind of data this is, and what the data is useful for, would be very helpful. I get the sense that this package is aimed at users who are already familiar with the dataset. Nevertheless, in my experience a reminder is still useful
A clearer description of exactly how data is being transformed. This could be acheived simply, e.g. by using knitr::kable(head(dataset_name)) to show just the first few lines. This would help to add context to what is meant by "long" and "wide" format (which mean different things to different people)
The YAML code block in particular should be better documented. Which options are available? What does each part of this list do? A handy reference or "cheatsheet" will be a huge favour to all users.
I think the second-biggest opportunity is in the naming of the functions, methods and perhaps even the classes involved here.
The authors might consider using a prefix for every function in the package. Many successful packages use this strategy, which takes advantage of text editors which use tab completion. Here, perhaps functions which use a certain class of object could be prefixed with that class name, ccd_episode_plot instead of episode.graph etc.
Some names and terminology have the potential to be confusing to users. For example sql.demographic.table has nothing to do with sql per se, but rather arranges tables such that SQL import would be possible.
Some names are probably only confusing for the programmers. there is a mix of cases used in function names (sql.demographic.table(ccd), extractIndexTable, unique_spell)
I noticed that, apparently, UpperCamelCase is apparently the convention for S4 classes: link

*At the risk of seeming like a meat-based goodpractice, I wanted to echo two comments made above:

it is better to use TRUE and FALSE, in place of T and F.
it is also safer to use vapply in place of sapply or unlist(lapply).
I wonder if some of the functions could be split apart. For example, episode.graph creates both a plot and a rearrangement of the data. Perhaps if the plotting code were moved to a plot() method for the ccRecord class
I note that all of Rcpp is listed as an Import, but only the function evalCpp was imported directly. Is loading the entire package necessary?
The Vignette contains an example of cleaning data, based on instructions in a YAML code block. create.cctable(ccd, yaml.load(conf), freq=1). I find using the yaml.load function here to be a bit awkward, and difficult for users who are unfamiliar with yaml. It might be easiest to encourage users to write a short file, conf.yaml, and to direct create.cctable to this filename, rather than to an R object.
I feel like pander and ggplot should be listed in Suggests, rather than Imports. Then users could use the basic functionality of this package without necessarily installing these packages. This might also entail separating where these functions are used from the data-cleaning that sometimes precedes them.
Some documentation, e.g. in ccRecord, refers to S3 classes when in fact these are S4 classes.
There seem to be some vestigial R scripts and other files in inst/sql and inst/report, could these be deleted, or perhaps moved to a development branch?

warnings and automatic checks

I ran all the code in the Vignette, and found that sql.demographic.table(ccd) produced the following error for me (first sentence only): Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference.
I ran devtools::spellcheck, and found the following errors (among other, trivial "errors" like variable names etc):

spelling errors:

WORD FOUND IN
anonymisation description:5
charcter data.quality.report.Rd:15
chuncks update_database.Rd:15
conf ccTable-class.Rd:39,45
datam extractIndexTable.Rd:12
datat extractIndexTable.Rd:11
demg demg.distribution.Rd:5,16
demograhic lenstay.Rd:10
dfilter getfilter.Rd:5,15
doesn selectTable.Rd:16
lenght lenstay.Rd:5,20
organised new.episode.Rd:21
structre ccTable-class.Rd:10
Subseting sub-sub-ccRecord-method.Rd:6,16
yaml ccTable-class.Rd:54
YAML ccTable-class.Rd:19, create.cctable.Rd:14, create2dclean.Rd:12,23

R CMD CHECK returned this warning:

Undocumented data sets:
‘icnarc_table’
All user-level objects in a package should have documentation entries.

Answer 2 · 2017-08-23T09:12:33.000Z

filename to create_cctable - no need to change because it is can already load the file path.
move episode graph to plot() ccRecord
re-name class functions
clalss function documentation - export it as a function
ccTable explaination - vignette
yaml conf cheatsheet
gp()
spell_check

Answer 3 · 2017-08-27T07:29:23.000Z

JOSS Review

Dear reviewers,

Thank you so much for your very helpful comments, and we are very sorry to be spending such a long to time to address these comments. We have made several re-structuring of the code, inspired by your comments.

Pipeline removal: we would like to put our focus more on data cleaning and wrangling. CCHIC is just an example to demonstrate, how we clean the high resolution heterogeneous longitudinal data. We have an ambition of making cleanEHR an extendable EHR data cleaning and wrangling platform. We feel the CCHIC is no longer in our main theme, hence removed.
Vignette: we’ve made two separate vignettes, one to describe what the CCHIC data is; another one to explain the data structures and demonstrate step by step the data cleaning process, including how to write the YAML configuration.
Further documentation: Previously, we only used doc string to document the refclass functions, which is not sufficient, as they are the main components of this package. We made extra documentation for the member functions.
Naming: more consistent naming of the functions, and class.

To Reviewer 1 @haozhu223

I personally think this package agrees with rOpenSci's fit policy and is nice to be listed. However, I also want to note that maybe rOpenHealth fits this topic better?

We are not clear about the nature of rOpenHealth, since we didn’t find too much information online. Is it a subsidiary of rOpenSci? Can we be listed in both organisation? We would love to explore the possibilities here.

I do have a few comments after trying out this package. As @noamross mentioned, regarding the package name, I would recommend changing the name to cchicr or else instead of cleanEHR due to the fact that this is only one particular form of electronic health record.

The critical care data (CC-HIC) is one of the themes of National Institute for Health Research (NIHR) Health Informatics collaborative (HIC). It has other themes which covers a large area of health care topics. https://www.nihr.ac.uk/about-us/how-we-are-managed/our-structure/infrastructure/health-informatics-collaborative.htm
cleanEHR was designed as an extendable data cleaning tool. CCHIC is the first exemplar for it and we planed to extend cleanEHR functionality to other themes and link with other data sources.

rOpenSci packaging guide recommends using xml2 instead of XML for parsing XML files. It seems to me that most of the XML-related codes in this package only involves parsing. As a result, I would recommend using xml2 instead.

Thank you for the suggestion and you have the point. We do believe xml2 will be a better fit to the XML parser. However, the XML parser is the most complicated part of the code. The XML format we receive from hospitals do not follow strictly the standard. As a result, there are many corner cases that we fixed throughout the years of development. The XML parser is rather like a plugin which serves only with CCHIC dataset. Concerning the complexity of re-writing it with xml2, will it be OK if we leave the xml part this time, and redo them when the new xml standard is introduced in the coming years?

This package is frequently using a very OO-ish programming style to define objects and write functions. For example, after a ccTable being created, you can use cctable$export.csv("file.csv") to write contents to a csv file (https://github.com/CC-HIC/cleanEHR/blob/master/R/ccTable.R#L206-L213). To me, due to my lack of understanding of the full scenario, it seems sort of like forcibly writing python in R and is a little redundant. (I'm not saying it's a bad practice. My concern is mainly about the lack of documentations for these methods as I said below)

Thanks for pointing it out. ccTable is a Refclass which is served as the data cleaning and manipulation platform. Since the tables it held are usually very big, we intented to use refclass to avoid multiple copying. As we have adopted the OO way, we would like to keep all the ccTable related function, e.g. export.csv to be consistent. More documentation has been made, I hope it is clearer this time.

Also, it would be nice if we can see more documentations for these self-defined methods or at least a table of available methods with brief explanation.

Thanks, we have added more documentation for the ccTable member methods, which are desigated as ccTable_methods.

I would like to see more documentation about pipeline. I feel like the current documentation in R/pipeline.R and inst/pipeline isn't quite enough. It would be nice to see a README in inst/pipeline.

As we mentioned above, the pipeline is removed from the package.

In the package vignette tour.Rmd, we see tb <- create.cctable(ccd, yaml.load(conf), freq=1). However, freq is actually the 2nd argument in that function. It might mislead some users in the future.

Amended, thanks.

I feel like the author can put more instructions in the package vignette.

New vignette has been made, thanks.

Final point, I'm not sure if it's a good idea but I feel like building a shiny app to deal with these tasks will be very helpful, especially to help users to understand what kinds of functionalities are available in this package.

It is really a good point! We have the plan of making a website using shiny for data displaying and cleanEHR functionalities. It will be done as a separate project.

To Reviewer2

The authors might consider using a prefix for every function in the package. Many successful packages use this strategy, which takes advantage of text editors which use tab completion. Here, perhaps functions which use a certain class of object could be prefixed with that class name, ccd_episode_plot instead of episode.graph etc.

Thank you for your suggestion. We recognise that the naming is a huge problem. As the package started with a lot of experimental elements, naming has not been paid with too much attention. We amended the package to adopt the naming style using prefix where applicable. In particular, the episode.graph() function has been moved to the plot() method for ccEpisode.

Some names and terminology have the potential to be confusing to users. For example sql.demographic.table has nothing to do with sql per se, but rather arranges tables such that SQL import would be possible.

sql.demographic.table has been renamed as ccd_demographic_table.

Some names are probably only confusing for the programmers. there is a mix of cases used in function names (sql.demographic.table(ccd), extractIndexTable, unique_spell)

Amended accordingly.

it is better to use TRUE and FALSE, in place of T and F.

Modified, thanks.

it is also safer to use vapply in place of sapply or unlist(lapply).

We didn’t manage to switch for all the cases, but we changed a couple of them.

I wonder if some of the functions could be split apart. For example, episode.graph creates both a plot and a rearrangement of the data. Perhaps if the plotting code were moved to a plot() method for the ccRecord class

This is a great idea and we moved it to plot() for ccEpisode class now.

I note that all of Rcpp is listed as an Import, but only the function evalCpp was imported directly. Is loading the entire package necessary?

Amended accordingly. Thanks.

The Vignette contains an example of cleaning data, based on instructions in a YAML code block. create.cctable(ccd, yaml.load(conf), freq=1). I find using the yaml.load function here to be a bit awkward, and difficult for users who are unfamiliar with yaml. It might be easiest to encourage users to write a short file, conf.yaml, and to direct create.cctable to this filename, rather than to an R object.

In fact, the function can accept both list and the path of a YAML configuration file as a input. You are right, it might be confusing to use yaml.load in the vignette. In the new version of the vignette, we demonstrated it without using yaml.load.

I feel like pander and ggplot should be listed in Suggests, rather than Imports. Then users could use the basic functionality of this package without necessarily installing these packages. This might also entail separating where these functions are used from the data-cleaning that sometimes precedes them.

We actually feel the data quality reporting and visualisation, where pander and ggplot are used, are crucial to the data cleaning process. For this reason, we feel it is perhaps necessary to make these functions loaded from the beginning. Please let me know if you feel otherwise. We can discuss about this.

Some documentation, e.g. in ccRecord, refers to S3 classes when in fact these are S4 classes.

That is a mistake. Amended!

There seem to be some vestigial R scripts and other files in inst/sql and inst/report, could these be deleted, or perhaps moved to a development branch?

Unnecessary files removed.

I ran all the code in the Vignette, and found that sql.demographic.table(ccd) produced the following error for me (first sentence only): Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference.

This is probably a warning from data.table. In some R version it happens. But you can still run through.

I ran devtools::spellcheck, and found the following errors (among other, trivial "errors" like variable names etc):

Amended.