vanderbilt-data-science/lapop-dashboard

Issues with country data files

Closed this issue · 4 comments

  1. Columns in data files are very different and missing columns present in questions_and_categories lookup table.
  1. atg_2016_cy_lan_p is has country that is not present in the country lookup table. This country was numbered as 33 and we don't seem to have 33 anywhere.
  1. Inconsistency in idnum column in data files. Either different types or named differently.

List of problematic files:

Incorrect Year In File:

  • bra_2007_cy_lan_p.dta
  • guy_2008_cy_lan_p.dta
  • ury_2007_cy_lan_p.dta
  • ven_2007_cy_lan_p.dta

  • atg_2016_cy_lan_p.dta ----> not sure what atg is
    and country does not exist
    in lookups.
  • gtm_2006_cy_lan_p.dta ----> throws weird error when trying
    to read the file

Multi year files, unsure how to label year and "Wave" in these
ven_2016-2017_cy_lan_p
ecu_2016-2017_cy_lan_p

Suggested solutions:

  1. Use new data files with updated names, stored in Box as data_v6
  2. Where filename and contents disagree on year, use filename year as wave, and content year as year
  3. ATG is Antigua and Barbuda; I shared an updated country/ISO/LAPOP code table in Slack (not sure what you are using, but it would be great if you can update that resource with what I shared https://datasciencetip.slack.com/files/U010GMT8J2X/F013M8ZDGSH/iso_3166-1-alpha3_country-codes.csv)
  4. The GTM file seems okay; I'm having trouble reading it with haven, but read.dta13 from the package readstata13 does the job
  5. In multi-year files, use content year as year, and for now the first filename year as wave