ropensci/rnoaa

`ghcnd_read` expects the wrong file format for .dly files.

jonathan-g opened this issue · 3 comments

Bug description

ghcnd_read fails with an error because it expects a .dly file to be a .csv file, but it's a fixed-width file with no delimiters between columns.

Reprex

library(rnoaa)
download.file("ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/all/USW00013897.dly", "USW00013897.dly")
ghcnd_read("USW00013897.dly")
#> Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
#> cols = 1 != length(data) = 128
#> Error: Columns 2, 3, 4, 5, 6, and 122 more must be named.

Created on 2021-01-29 by the reprex package (v0.3.0)

This is what the first several lines of `USW00013897.dly" look like:

USC00111577192802TMAX-9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999      39  0-9999   -9999   
USC00111577192802TMIN-9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999     -28  0-9999   -9999   
USC00111577192802PRCP-9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999   -9999       0T 0-9999   -9999   

And this is the file format, as described in the readme-1.txt at the FTP site ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt:

III. FORMAT OF DATA FILES (".dly" FILES)

Each ".dly" file contains data for one station.  The name of the file
corresponds to a station's identification code.  For example, "USC00026481.dly"
contains the data for the station with the identification code USC00026481).

Each record in a file contains one month of daily data.  The variables on each
line include the following:

------------------------------
Variable   Columns   Type
------------------------------
ID            1-11   Character
YEAR         12-15   Integer
MONTH        16-17   Integer
ELEMENT      18-21   Character
VALUE1       22-26   Integer
MFLAG1       27-27   Character
QFLAG1       28-28   Character
SFLAG1       29-29   Character
VALUE2       30-34   Integer
MFLAG2       35-35   Character
QFLAG2       36-36   Character
SFLAG2       37-37   Character
  .           .          .
  .           .          .
  .           .          .
VALUE31    262-266   Integer
MFLAG31    267-267   Character
QFLAG31    268-268   Character
SFLAG31    269-269   Character
------------------------------

Session Info

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rnoaa_1.3.0          revealjg_0.9.9006    rprojroot_2.0.2      reticulate_1.18      lubridate_1.7.9.2    forcats_0.5.0       
 [7] stringr_1.4.0        dplyr_1.0.3          purrr_0.3.4          readr_1.4.0          tidyr_1.1.2          tibble_3.0.5        
[13] ggplot2_3.3.3        tidyverse_1.3.0      yaml_2.2.1           rmarkdown_2.6.6.9000 knitr_1.30           pacman_0.5.1        

loaded via a namespace (and not attached):
 [1] fs_1.5.0           usethis_2.0.0      RColorBrewer_1.1-2 httr_1.4.2         tools_4.0.3        backports_1.2.1   
 [7] bslib_0.2.3.9000   utf8_1.1.4         R6_2.5.0           DBI_1.1.1          colorspace_2.0-0   withr_2.4.0       
[13] tidyselect_1.1.0   gridExtra_2.3      processx_3.4.5     curl_4.3           compiler_4.0.3     cli_2.2.0         
[19] rvest_0.3.6        xml2_1.3.2         triebeard_0.3.0    sass_0.3.0.9000    scales_1.1.1       callr_3.5.1       
[25] askpass_1.1        rappdirs_0.3.1     digest_0.6.27      pkgconfig_2.0.3    htmltools_0.5.1.1  dbplyr_2.0.0      
[31] rlang_0.4.10       readxl_1.3.1       rstudioapi_0.13    httpcode_0.3.0     jquerylib_0.1.3    generics_0.1.0    
[37] jsonlite_1.7.2     magrittr_2.0.1     credentials_1.3.0  Matrix_1.3-2       Rcpp_1.0.6         munsell_0.5.0     
[43] fansi_0.4.2        clipr_0.7.1        lifecycle_0.2.0    stringi_1.5.3      whisker_0.4        grid_4.0.3        
[49] crayon_1.3.4       slider_0.1.5       lattice_0.20-41    haven_2.3.1        hms_1.0.0          sys_3.4           
[55] ps_1.5.0           pillar_1.4.7       crul_1.0.0         reprex_0.3.0       XML_3.99-0.5       glue_1.4.2        
[61] evaluate_0.14      hoardr_0.5.2       data.table_1.13.6  remotes_2.2.0      modelr_0.1.8       vctrs_0.3.6       
[67] urltools_1.7.3     cellranger_1.1.0   gtable_0.3.0       openssl_1.4.3      assertthat_0.2.1   xfun_0.20         
[73] broom_0.7.3.9000   warp_0.2.0         ellipsis_0.3.1     here_1.0.1        

thanks for catching that!

up for helping out by sending a PR?

I added the ghcnd_read fxn as an afterthought not thinking it through all the way. When we download the files with ghncd() they are in the fixed width format you describe, but then before writing them to disk we process them to make them more digestible - see https://github.com/ropensci/rnoaa/blob/master/R/ghcnd.R#L202-L223 - THEN they are written to disk in comma sep format

So probably ideally we change gchnd_read() to read a file directly from NOAA in fixed width format AND in comma sep format (i.e., from a call to ghcnd()) - Sound good?

So, i think we:

  • factor out the code in ghcnd_GET to process a fwf file to a fxn, e.g., process_fwf
  • use process_fwf in ghcnd_GET to replace the code just factored out
  • use process_fwf in ghcnd_read if the file is fwf, or simply read as comma sep format if already a csv

I will be happy to submit a PR to fix this. It may take a little while for me to get to it, but I will be happy to do this if you're not in a hurry.

Great, not in a hurry unless CRAN maintainers get in touch about any failures, e.g. #382