pharmaverse/metatools

Issues with build_from_derived() for multiple datasets

Closed this issue · 9 comments

@statasaurus - we're trying to get metacore/metatools/xportr working for some Roche templates.

For build_from_derived() it is not clear in documentation that it uses the sort key variables from key_seq to join by if you provide multiple datasets via ds_list. This doesn’t always work so well in practice as often ADaMs are sorted by derived variables like ADTM which are not yet derived at this stage. Also these would never exist in ADSL. Given it only ever is really a case of merging ADSL with an SDTM at this stage could we not just use STUDYID, USUBJID somehow – or allow the user to pass the join variables.

Also it would be good to have such an example in the documentation of how you would go about using this for say an ADAE where you need to take variables from both ADSL and AE. I've been playing around with different usages of dataset_name argument but I can't figure it out. Any advice appreciated!

@rossfarrugia I can totally fix the documentation. Also I have fixed this issue (at least to how I understand it) in dev. So you might need to pull from there and try again. But I made it so it will only use the key_seq variables that are present

cool, that approach makes sense. i'll try your dev version.

Assuming above means "that are present in both datasets", but will check.

Yes exactly that 😸

@statasaurus it runs now and the join works but several ADSL vars get dropped and i can't understand why as they're all in my metacore object that I've read in from the specs.

For example I have:

metacore$var_spec

# A tibble: 149 × 6
   variable length label                            type    format common
   <chr>     <int> <chr>                            <chr>   <chr>  <lgl> 
 1 STUDYID       8 Study Identifier                 text    NA     FALSE 
 2 USUBJID      50 Unique Subject Identifier        text    NA     FALSE 
 3 SUBJID       50 Subject Identifier for the Study text    NA     FALSE 
 4 SITEID       20 Study Site Identifier            text    NA     FALSE 
 5 AGE           8 Age                              integer NA     FALSE 

But when I run the following i lose for example the AGE variable from ADSL:

adae_preds <- build_from_derived(metacore,
                                 ds_list = list("adsl" = adsl, "ae" = ae),
                                 predecessor_only = FALSE,
                                 keep = TRUE)

What does the derivation look like?

> metacore$derivations
# A tibble: 163 × 2
   derivation_id derivation         
   <chr>         <chr>              
 1 ADAE.STUDYID  "ADSL.STUDYID"     
 2 ADAE.USUBJID  "ADSL.USUBJID"     
 3 ADAE.SUBJID   "ADSL.SUBJID"      
 4 ADAE.SITEID   "ADSL.SITEID"      
 5 ADAE.REGION1  "ADSL.REGION1 \r\n"
 6 ADAE.COUNTRY  "ADSL.COUNTRY \r\n"
 7 ADAE.ETHNIC   "ADSL.ETHNIC"      
 8 ADAE.AGE      "ADSL.AGE \r\n"    
 9 ADAE.AGEU     "ADSL.AGEU \r\n"   
10 ADAE.AAGE     "ADSL.AAGE \r\n"

i think that's it! all the ones getting dropped have the \r\n - i found it's because the Excel file we read in has a new blank line below. we'll address this in our reader function but might be worth an assertive check for future as sure other companies could have same issue.

Another consideration here is that sometimes company specs read like:

AE.AEDECOD
MedDRA Version xx.x

This wouldn't work for this function and the variable would get dropped, so maybe worth considering just checking the first line here so that such cases still work.

So I am not going to handle the second case you mentioned. I think it will be too difficult to parse cause the derivation column in a bit of a free for all. But, I have added the removal of whitespace so the issue you ran across shouldn't happen again even if you don't change your readers

This is now on CRAN