This package is under the very early stages of initial development.
- SQLite database creation and a tutorial (below) on basic usage
- Some built in processing functions for ISARIC data
- There is also a pyISARICBasics package that provides similar functionality for researchers who wish to use Python: (https://github.com/KyleYoung1997/pyISARICBasics)
During this tutorial, we'll use the convention of appending packagename::
to
every function that comes from a package outside of base R.
This makes it easy to keep track
of which R package each function comes from. So, for example, instead of
writing install_github()
(a function from the devtools
package), we will
write devtools::install_github()
.
For this tutorial, it will be handy to have file paths prepared.
First, create an empty list called DIRS
, which will store all of the file
paths.
DIRS <- list()
Now, set the location of the ISARIC data CSV files:
DIRS$data <- "C:/my/path/07APR2021/csvs"
We'll be making a SQLite database, and we need somewhere to store it, as well as a name
for the database file. The following will set the database location to the
same directory as the CSVs, and give it the name "db.sqlite"
:
DIRS$db_filename <- "db.sqlite"
DIRS$db <- DIRS$data
Start by installing the devtools
package:
install.packages("devtools")
You can now install the current version of ISARICBasics with:
devtools::install_github("ISARICDataPlatform/ISARICBasics")
Then, load the ISARICBasics
package:
library(ISARICBasics)
The database requires no additional installation, and is built from
the CSV files provided by ISARIC. Assuming you have created the DIRS
list, as above,
you can build the SQLite database with:
ISARICBasics::build_sqlite(
csv_folder=DIRS$data,
sql_folder=DIRS$db,
sql_filename=DIRS$db_filename,
overwrite=FALSE)
The setting overwrite=FALSE
makes sure that, if the SQLite database has already
been built, no existing tables will be overwritten. Setting overwrite=TRUE
will cause every table in the database to be overwritten. However, note that the
ISARIC CSV files will never be changed or overwritten.
If you want more information on the build_sqlite
function, you can open the
help file by executing the code:
?ISARICBasics::build_sqlite
One advantage of having the SQLite database, is that browsing large tables is
easier and faster than it would be through, e.g., Excel. To browse the database,
you can install DB Browser for SQLite. Once
DB Browser is installed, you can open the SQLite database in DB Browser by
clicking on the database file. If you have followed this tutorial until now,
the SQLite database file will be located in the same folder as the ISARIC CSV
files, and will have the name db.sqlite
. DB Browser is quite powerful. Here
is a beginner friendly tutorial.
Now the SQLite database is built, it does not have to be built again. Each time you use it with R, you can connect to it with:
con <- DBI::dbConnect(RSQLite::SQLite(), DIRS$db)
The object con
represents a connection to the whole database. We can
find out what tables are available:
DBI::dbListTables(con)
#> [1] "DM" "DS" "IN" "LB" "SA" "VS"
To connect to just the LB table, we use con
as follows:
lb <- dplyr::tbl(con, "LB")
Now, lb
can be treated much like a data.frame
. We can easily print the
first 10 rows of the LB table, and some other information:
print(lb)
#> # Source: table<LB> [?? x 19]
#> # Database: sqlite 3.35.4
#> # [C:\my\path\07APR2021\csvs\db.sqlite]
#> STUDYID DOMAIN USUBJID LBSEQ LBTESTCD LBTEST LBCAT
#> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 CVMEWUS LB CVMEWUS_~ 1 HGB Hemoglob~ NA
#> 2 CVMEWUS LB CVMEWUS_~ 2 MCH Ery. Mea~ NA
#> 3 CVMEWUS LB CVMEWUS_~ 3 PLAT Platelets NA
#> 4 CVMEWUS LB CVMEWUS_~ 4 RBC Erythroc~ NA
#> 5 CVMEWUS LB CVMEWUS_~ 5 RDW Erythroc~ NA
#> 6 CVMEWUS LB CVMEWUS_~ 6 WBC Leukocyt~ NA
#> 7 CVMEWUS LB CVMEWUS_~ 7 IRON Iron NA
#> 8 CVMEWUS LB CVMEWUS_~ 8 K Potassium NA
#> 9 CVMEWUS LB CVMEWUS_~ 9 LDH Lactate ~ NA
#> 10 CVMEWUS LB CVMEWUS_~ 10 LYM Lymphocy~ NA
#> # ... with more rows, and 12 more variables:
#> # LBSCAT <int>, LBORRES <chr>, LBORRESU <chr>,
#> # LBSTRESC <chr>, LBSTRESN <dbl>, LBSTRESU <chr>,
#> # LBSTAT <chr>, LBREASND <chr>, LBSPEC <chr>,
#> # LBMETHOD <int>, LBDY <dbl>, LBEVINTX <chr>
Notice, at the top of the above output, the sentence Source: table<LB> [?? x 19]
.
The ??
indicates that we do not know how many rows the LB table has. This is
because the LB table hasn't been fully loaded into R memory. We are just connected
to the table, which still resides in the SQLite database.
To view just the column names of the LB
table:
colnames(lb)
#> [1] "STUDYID" "DOMAIN" "USUBJID" "LBSEQ"
#> [5] "LBTESTCD" "LBTEST" "LBCAT" "LBSCAT"
#> [9] "LBORRES" "LBORRESU" "LBSTRESC" "LBSTRESN"
#> [13] "LBSTRESU" "LBSTAT" "LBREASND" "LBSPEC"
#> [17] "LBMETHOD" "LBDY" "LBEVINTX"
From now on, we will start using the pipe operator %>%
. If you are
unfamiliar with this handy tool, here is a tutorial.
For a small table, such as DS, we can comfortably load the whole table into R,
using dplyr::collect()
:
ds <- dplyr::tbl(con, "DS")
ds_all <- ds %>% dplyr::collect()
However, the LB table is very large. Loading the whole table into R may be unnecessary
and slow. The R package dbplyr is installed
automatically with ISARICBasics
, and is useful for working with large SQL tables.
With dbplyr
installed, we can use many functions from the more well known
R package, dplyr, on our SQL table connections.
A common operation is to filter the data, to retrieve only rows matching some
condition. For example, we can filter the LB table, using our lb
connection,
to only connect to rows that have the value "CREAT"
entered in the
column LBTESTCD:
lb_creat <- lb %>% dplyr::filter(LBTESTCD == "CREAT")
The object lb_creat
is now a connection, just like lb
, but it is aware that
we only want to look at cases where LBTESTCD == "CREAT"
. If we want to load
all those cases into R memory, we can use dplyr::collect()
again:
lb_creat_all <- lb_creat %>% dplyr::collect()
Now the full table has been loaded in R memory, as data.frame
. We can,
for example, look at its dimensions.
dim(lb_creat_all)
#> [1] 403509 19
Above, see that we have 403,509 rows, and 19 columns, pertaining to "CREAT"
records in the LB table.
The dplyr
package provides many other ways for us to summarise and retrieve
information. For example, we can find out how many different units have
been used, and how many times, for reporting on CREAT tests:
lb_creat %>%
dplyr::group_by(LBORRESU) %>%
dplyr::summarise(n=n()) %>%
dplyr::collect()
#> # A tibble: 12 x 2
#> LBORRESU n
#> <chr> <int>
#> 1 NA 4561
#> 2 0.9 1
#> 3 MG/L 1
#> 4 MMOL/L 1
#> 5 U/L 2
#> 6 mg/L 9
#> 7 mg/dL 83307
#> 8 mg/dL' 1
#> 9 mmol/L 424
#> 10 mol 21
#> 11 umol 878
#> 12 umol/L 314303
Above, we have grouped the rows of lb_creat
, by the contents of the LBORRESU
column, and then we have summarised the result, asking for a tally of the
number of rows in each group, with dplyr::summarise(n=n())
. Grouping and
summarising are powerful tools. For example, we can find the proportion of
missing units (LBORRESU) for each type of lab test (LBTESTCD):
lb %>%
dplyr::group_by(LBTESTCD) %>%
dplyr::summarise(n=sum(is.na(LBORRESU), na.rm=T)) %>%
dplyr::collect()
#> # A tibble: 78 x 2
#> LBTESTCD n
#> <chr> <int>
#> 1 ALB 0
#> 2 ALP 2639
#> 3 ALT 12246
#> 4 AMYLASE 0
#> 5 APTT 3288
#> 6 APTTSTND 10396
#> 7 AST 10202
#> 8 BASEEXCS 38
#> 9 BASO 0
#> 10 BASOLE 0
#> # ... with 68 more rows
The dplyr::filter
function is also powerful. For example, we can use it to
do free text searches with the %like%
operator.
Here, we search for all LBTEST entries containing the
word "platelets":
lb %>% dplyr::filter(LBTEST %like% "%platelets%")
There are many more tools available in dplyr
, and many are discussed
clearly in the documentation.
ISARIC data are stored in an SDTM format and collected from the ISARIC Case Report Forms (CRF).
ISARICBasics processing functions are designed to transform some of the more complicated encodings for ISARIC data, to simpler formats.
To start, we will work with the SA and HO tables, to categorise events according to whether they occurred ('yes'), did not occur ('no'), or have an unknown status ('unknown') at each point in time.
sa <- dplyr::tbl(con, "SA")
ho <- dplyr::tbl(con, "HO")
colnames(sa)
#> [1] "STUDYID" "DOMAIN" "USUBJID" "SASEQ" "SATERM"
#> [6] "SAMODIFY" "SACAT" "SASCAT" "SAPRESP" "SAOCCUR"
#> [11] "SASTAT" "SAREASND" "SADY" "SASTDY" "SAENDY"
#> [16] "SADUR" "SASTRF" "SAEVLINT" "SAEVINTX"
colnames(ho)
#> [1] "STUDYID" "DOMAIN" "USUBJID" "HOSEQ" "HOTERM"
#> [6] "HODECOD" "HOPRESP" "HOOCCUR" "HOSTAT" "HOREASND"
#> [11] "HODY" "HOSTDY" "HOENDY" "HODUR" "HOSTRF"
#> [16] "HOEVINTX" "HODISOUT" "SELFCARE" "HOINDC"
We can see above that SA and HO have many similar column names,
when the prefix xx
is ignored. In particular, they both have:
xxTERM
: the verbatim wording of the event (non-standardised).xxOCCUR
: helps indicate whether an event occurred.xxPRESP
: 'y' indicates the observation was pre-specified on the CRF, while missing or 'N' indicates spontaneous reports.xxSTDY
: the day of the event, relative to admission.
Rather than work with the verbatim wording of an event (xxTERM
), which is
highly variable, we can work with the standardised versions, e.g.,
HODECOD
or SAMODIFY
.
To understand xxOCCUR
and xxPRESP
better, we can find out how many combinations
of these appear in each table, and how many times each combination occurs:
sa %>%
dplyr::group_by(SAOCCUR, SAPRESP) %>%
dplyr::summarise(n=n()) %>%
dplyr::collect()
#> # A tibble: 5 x 3
#> # Groups: SAOCCUR [4]
#> SAOCCUR SAPRESP n
#> <chr> <chr> <int>
#> 1 NA NA 250815
#> 2 NA Y 43496
#> 3 N Y 17334310
#> 4 U Y 5388019
#> 5 Y Y 2426457
ho %>%
dplyr::group_by(HOOCCUR, HOPRESP) %>%
dplyr::summarise(n=n()) %>%
dplyr::collect()
#> # A tibble: 6 x 3
#> # Groups: HOOCCUR [4]
#> HOOCCUR HOPRESP n
#> <chr> <chr> <int>
#> 1 NA NA 218769
#> 2 NA Y 1000
#> 3 N Y 651063
#> 4 U Y 6331
#> 5 Y NA 577
#> 6 Y Y 468717
We can see, for example, that in the HO table, the value HOOCCUR = 'Y'
appeared beside the value HOPRESP = NA
, a total of 577 times.
The function ISARICBasics::process_occur
will process each possible combination
of xxOCCUR
and xxPRESP
into one column, called status
, which takes values
'yes' (occurred), 'no' (did not occur), or 'unknown' (unknown status):
ho_modified <- ho %>%
ISARICBasics::process_occur(xxOCCUR = HOOCCUR, xxPRESP = HOPRESP)
We can now summarise the results to see how the new column named status
compares to HOOCCUR
and HOPRESP
:
ho_modified %>%
dplyr::group_by(HOOCCUR, HOPRESP, status) %>%
dplyr::summarise(n=n()) %>%
dplyr::collect()
#> # A tibble: 6 x 4
#> # Groups: HOOCCUR, HOPRESP [6]
#> HOOCCUR HOPRESP status n
#> <chr> <chr> <chr> <int>
#> 1 NA NA yes 218769
#> 2 NA Y unknown 1000
#> 3 N Y no 651063
#> 4 U Y unknown 6331
#> 5 Y NA yes 577
#> 6 Y Y yes 468717
Note that the status
column added by ISARICBasics::process_occur
is
lowercase, while the other columns are uppercase. This is a good convention
to keep track of which columns are from the original ISARIC data (uppercase) and which
columns are derived (lowercase).