ccb-hms/nhanes-database

Some codebook tables get rows repeated in DB

Closed this issue · 2 comments

Example:

> nhanesCodebook("WHQ")$WHD120$WHD120
  Code.or.Value Value.Description Count Cumulative Skip.to.Item
1     66 to 400   Range of Values  4019       4019         <NA>
2         77777           Refused     2       4021         <NA>
3         99999        Don't know   239       4260         <NA>
4             .           Missing  1784       6044         <NA>
5     66 to 400   Range of Values  4019       4019         <NA>
6         77777           Refused     2       4021         <NA>
7         99999        Don't know   239       4260         <NA>
8             .           Missing  1784       6044         <NA>

whereas in the same table, other variables seem fine.

> nhanesCodebook("WHQ")$WHD130$WHD130
  Code.or.Value Value.Description Count Cumulative Skip.to.Item
1      39 to 79   Range of Values  2270       2270         <NA>
2          7777           Refused     1       2271         <NA>
3          9999        Don't know   146       2417         <NA>
4             .           Missing  3627       6044         <NA>

The source looks fine:

https://wwwn.cdc.gov/nchs/nhanes/1999-2000/WHQ.htm#WHD120

and the non-DB version of nhanesCodebook() also looks OK.

I think this is the only table with this problem.

In some other cases the rows are duplicated in the source webpage. I am not sure what we should store in those cases, but if anyone want to look at examples:

https://wwwn.cdc.gov/Nchs/Nhanes/2003-2004/KIQ_U_C.htm

https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DXX_H.htm

https://wwwn.cdc.gov/nchs/nhanes/2003-2004/L06MH_C.htm

@sam-pullman - let @rsgoncalves know if you'd rather this be addressed in the metadata. He identified it there too.

@Genoa-HMS I suggest we discuss this during the Epiconductor exploration in-person meeting, we need to identify where the root of this issue is coming from before we can assign a team to it. This could be caused by the translation code, the metadata, the nhanesCodebook(), or the raw CDC data.