Exploring NHANES data through the nhanesA package.
The nhanesA
package can be used via Docker; see instructions at
https://github.com/ccb-hms/NHANES
It can be also used locally with the DB by setting suitable environment variables, e.g.,
export EPICONDUCTOR_CONTAINER_VERSION=v0.1.0
export EPICONDUCTOR_COLLECTION_DATE=2023-10-02
export EPICONDUCTOR_DB_DRIVER=FreeTDS # default is MS SQL Driver
export EPICONDUCTOR_DB_SERVER=localhost
export EPICONDUCTOR_DB_PORT=1433
It can also use the Bioconductor package BiocFileCache
to cache
downloaded files locally (set nhanesOptions(use.cache = TRUE)
).
-
R/variable-metadata.R
: extract and save per-(table,variable) metadata usingnhanesTableSummary()
. Comments here document what that function returns, which should eventually be incorporated in the man page. -
R/variable-summary-table.R
: create a per-variable summary that can be used to search variables (using their SAS labels) -
R/participant-summary.R
: combine the demographic datasets to create a basic per-participant summary.
- There are many capitalization inconsistencies, e.g. "Don't know" and "Don't Know". Could we 'fix' this in the database codebooks somehow (before translation)?
> mdb <- nhanesA:::.nhanesQuery("SELECT * FROM Metadata.VariableCodebook")
> length(unique(mdb$ValueDescription))
[1] 3731
> length(unique(tolower(mdb$ValueDescription)))
[1] 3546
-
Make searchable HTML table available somewhere. May be worth exploring if we can load JSON files separately.
-
It may be useful to segregate by 'component' or other kinds of grouping just to make tables of manageable size.
-
Using GitHub pages is one option --- see https://deepayan.github.io/nhanes/ for a sample.
-
In general, we may consider if the
ccb-hms/NHANES
repo would be a good place to host a browseable website that shows examples of analyses that we can do. -
As part of an analysis, once we decide on a list of variables to look at, there should be a function that combines them. This is already done to some extent in the
jointQuery()
function in thephonto
package. This can- combine tables within same cycle, specifying variables:
df = jointQuery( list(BPQ_J=c("BPQ020", "BPQ050A"), DEMO_J=c("RIDAGEYR","RIAGENDR")))
- combine tables across cycles:
df <- jointQuery(
list(BPQ_C = c("BPQ020", "BPQ050A"), BPQ_J = c("BPQ020", "BPQ050A"),
DEMO_C = c("RIDAGEYR","RIAGENDR"), DEMO_J = c("RIDAGEYR","RIAGENDR"))
)
-
The other variant I can think of is to specify just the list of variables, for the code to automatically find them from suitable tables.
-
Optionally some
SEQN
values could be specified as well. These are sequential over cycles, so could be in principle used to find which table each data point comes from.
load("participantSummary.rda")
tapply(participantSummary, ~ table, with, range(SEQN), simplify = FALSE) |>
array2DF(allowLong = FALSE)
table Value1 Value2
1 DEMO 1 9965
2 DEMO_B 9966 21004
3 DEMO_C 21005 31126
4 DEMO_D 31127 41474
5 DEMO_E 41475 51623
6 DEMO_F 51624 62160
7 DEMO_G 62161 71916
8 DEMO_H 73557 83731
9 DEMO_I 83732 93702
10 DEMO_J 93703 102956
-
To start with, it should be easy to write a function that takes a list of variables, and using the variable manifest to create a list that is suitable as input for
jointQuery()
. This should check if a variable appears in multiple tables in the same cycle --- and do something useful. -
To help decide what to do when a variable appears in multiple tables, we should write a helper function that finds all occurrences of a variable across tables, and looks for duplicate
SEQN
values. In case there are duplicateSEQN
values, it should first check that the values are consistent for each suchSEQN
. If not, it should summarize how they differ. Details will depend on what kinds of differences we actually observe. -
Once these are done, we should be ready to do actual analysis; e.g., follow up on the asthma subtypes, ENWAS (details?)