epiforecasts/socialmixr

get_survey() does not return the wave or day of survey information

krivit opened this issue · 2 comments

This is using CRAN or GitHub main:

library(socialmixr)
#> 
#> Attaching package: 'socialmixr'
#> The following object is masked from 'package:utils':
#> 
#>     cite
be_survey <- get_survey("10.5281/zenodo.4035001")
#> Getting CoMix social contact data (Belgium ).
#> Downloading https://zenodo.org/api/files/491f3ab1-d0c7-4db1-9e85-f6a7d617d442/CoMix_be_contact_common.csv
#> Downloading https://zenodo.org/api/files/491f3ab1-d0c7-4db1-9e85-f6a7d617d442/CoMix_be_contact_extra.csv
#> Downloading https://zenodo.org/api/files/491f3ab1-d0c7-4db1-9e85-f6a7d617d442/CoMix_be_hh_common.csv
#> Downloading https://zenodo.org/api/files/491f3ab1-d0c7-4db1-9e85-f6a7d617d442/CoMix_be_hh_extra.csv
#> Downloading https://zenodo.org/api/files/491f3ab1-d0c7-4db1-9e85-f6a7d617d442/CoMix_be_participant_common.csv
#> Downloading https://zenodo.org/api/files/491f3ab1-d0c7-4db1-9e85-f6a7d617d442/CoMix_be_participant_extra.csv
#> Downloading https://zenodo.org/api/files/491f3ab1-d0c7-4db1-9e85-f6a7d617d442/CoMix_be_sday.csv
#> Downloading https://zenodo.org/api/files/491f3ab1-d0c7-4db1-9e85-f6a7d617d442/CoMix_BE_sday.csv
#> Using CoMix social contact data (Belgium ). To cite this in a publication, use the 'cite' function
str(be_survey)
#> List of 3
#>  $ participants:Classes 'data.table' and 'data.frame':   20775 obs. of  28 variables:
#>   ..$ hh_id                                    : chr [1:20775] "HH1200101" "HH1200101" "HH1200102" "HH1200102" ...
#>   ..$ part_id                                  : int [1:20775] 1200101 1200101 1200102 1200102 1200103 1200104 1200104 1200105 1200105 1200106 ...
#>   ..$ part_age                                 : int [1:20775] 73 73 73 73 73 73 73 73 73 73 ...
#>   ..$ part_gender                              : chr [1:20775] "M" "M" "M" "M" ...
#>   ..$ part_occupation                          : chr [1:20775] "retired" "retired" "retired" "retired" ...
#>   ..$ multiple_contacts_child_work             : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_child_school           : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_child_other            : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_adult_work             : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_adult_school           : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_adult_other            : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_older_adult_work       : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_older_adult_school     : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_older_adult_other      : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_child_work_phys        : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_child_school_phys      : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_child_other_phys       : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_adult_work_phys        : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_adult_school_phys      : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_adult_other_phys       : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_older_adult_work_phys  : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_older_adult_school_phys: int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ multiple_contacts_older_adult_other_phys : int [1:20775] NA NA NA NA NA NA NA NA NA NA ...
#>   ..$ part_education                           : chr [1:20775] "professional upper secondary (6 years)" "professional upper secondary (6 years)" "professional upper secondary (6 years)" "professional upper secondary (6 years)" ...
#>   ..$ panel_id                                 : int [1:20775] 12001 12001 12001 12001 12001 12001 12001 12001 12001 12001 ...
#>   ..$ country                                  : Factor w/ 1 level "Belgium": 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ hh_size                                  : int [1:20775] 2 2 2 2 2 2 2 2 2 2 ...
#>   ..$ hh_type                                  : chr [1:20775] "Couple with no children" "Couple with no children" "Couple with independent children only" "Couple with independent children only" ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr "hh_id"
#>  $ contacts    :Classes 'data.table' and 'data.frame':   31639 obs. of  24 variables:
#>   ..$ cont_id              : int [1:31639] 12001001 12001002 12001003 12001004 12001005 12001006 12001007 12001008 12002001 12002002 ...
#>   ..$ part_id              : int [1:31639] 1200108 1200108 1200108 1200108 1200108 1200108 1200108 1200108 1200208 1200208 ...
#>   ..$ cnt_age_exact        : logi [1:31639] NA NA NA NA NA NA ...
#>   ..$ cnt_age_est_min      : int [1:31639] 55 0 55 0 55 55 55 55 45 20 ...
#>   ..$ cnt_age_est_max      : int [1:31639] 64 120 64 120 64 64 64 64 54 24 ...
#>   ..$ cnt_gender           : chr [1:31639] "F" NA "F" NA ...
#>   ..$ frequency_multi      : int [1:31639] 1 5 1 5 1 1 1 1 1 1 ...
#>   ..$ phys_contact         : int [1:31639] 1 2 1 2 1 1 1 1 2 2 ...
#>   ..$ cnt_home             : logi [1:31639] TRUE FALSE TRUE TRUE TRUE TRUE ...
#>   ..$ cnt_work             : logi [1:31639] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   ..$ cnt_school           : logi [1:31639] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   ..$ cnt_transport        : logi [1:31639] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   ..$ cnt_leisure          : logi [1:31639] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   ..$ cnt_otherplace       : logi [1:31639] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   ..$ duration_multi       : int [1:31639] 5 1 5 1 5 5 5 5 5 4 ...
#>   ..$ cnt_outside_other    : logi [1:31639] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   ..$ cnt_other_house      : logi [1:31639] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   ..$ cnt_worship          : logi [1:31639] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   ..$ cnt_supermarket      : logi [1:31639] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   ..$ cnt_shop             : logi [1:31639] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   ..$ cnt_public_market    : logi [1:31639] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>   ..$ individually_reported: int [1:31639] 1 1 1 1 1 1 1 1 1 1 ...
#>   ..$ cnt_gender_all       : chr [1:31639] "female" NA "female" NA ...
#>   ..$ frequency_multi_all  : chr [1:31639] "1-2 days" "never met" "1-2 days" "never met" ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr "cont_id"
#>  $ reference   :List of 5
#>   ..$ title  : chr "CoMix social contact data (Belgium )"
#>   ..$ bibtype: chr "Misc"
#>   ..$ author :List of 11
#>   .. ..$ :Class 'person'  hidden list of 1
#>   .. .. ..$ :List of 5
#>   .. .. .. ..$ given  : chr "Pietro"
#>   .. .. .. ..$ family : chr "Coletti"
#>   .. .. .. ..$ role   : NULL
#>   .. .. .. ..$ email  : NULL
#>   .. .. .. ..$ comment: NULL
#> [SNIP]
#>   .. ..$ :Class 'person'  hidden list of 1
#>   .. .. ..$ :List of 5
#>   .. .. .. ..$ given  : chr "Niel"
#>   .. .. .. ..$ family : chr "Hens"
#>   .. .. .. ..$ role   : NULL
#>   .. .. .. ..$ email  : NULL
#>   .. .. .. ..$ comment: NULL
#>   .. ..- attr(*, "class")= chr "person"
#>   ..$ year   : int 2020
#>   ..$ doi    : chr "10.5281/zenodo.4035001"
#>  - attr(*, "class")= chr "survey"

Created on 2022-06-02 by the reprex package (v2.0.1)

Is it possible to obtain survey day and wave for each participant using the package?

sbfnk commented

Thanks for reporting this - at the moment this isn't possible because it is assumed that wave can identify the same participant being interviewed in repeated surveys. Here the main files don't have a wave column and therefore it is conservatively assumed that this cannot be merged because we don't know which wave one particular set of survey days refers to:

additional_id_identifiers <- c("sday_part_number", "wave")

can_merge <- vapply(files, function(x) {

It seems that in this particular data set participant IDs are unique across waves but I don't think it can be assumed that this is guaranteed for any survey. I'll need to think about how this could be addressed (and would welcome suggestions should you have any).

The workaround that we are currently using is to download the *_sday.csv separately and join it with the participant and the contact tables by participant ID. I am not 100% sure this is the correct thing to do, but it at least seems to give sensible results.