`ghcnd_read` expects the wrong file format for .dly files.
jonathan-g opened this issue · 3 comments
Bug description
ghcnd_read
fails with an error because it expects a .dly
file to be a .csv
file, but it's a fixed-width file with no delimiters between columns.
Reprex
library(rnoaa)
download.file("ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/all/USW00013897.dly", "USW00013897.dly")
ghcnd_read("USW00013897.dly")
#> Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
#> cols = 1 != length(data) = 128
#> Error: Columns 2, 3, 4, 5, 6, and 122 more must be named.
Created on 2021-01-29 by the reprex package (v0.3.0)
This is what the first several lines of `USW00013897.dly" look like:
USC00111577192802TMAX-9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 39 0-9999 -9999
USC00111577192802TMIN-9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -28 0-9999 -9999
USC00111577192802PRCP-9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 0T 0-9999 -9999
And this is the file format, as described in the readme-1.txt
at the FTP site ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt:
III. FORMAT OF DATA FILES (".dly" FILES)
Each ".dly" file contains data for one station. The name of the file
corresponds to a station's identification code. For example, "USC00026481.dly"
contains the data for the station with the identification code USC00026481).
Each record in a file contains one month of daily data. The variables on each
line include the following:
------------------------------
Variable Columns Type
------------------------------
ID 1-11 Character
YEAR 12-15 Integer
MONTH 16-17 Integer
ELEMENT 18-21 Character
VALUE1 22-26 Integer
MFLAG1 27-27 Character
QFLAG1 28-28 Character
SFLAG1 29-29 Character
VALUE2 30-34 Integer
MFLAG2 35-35 Character
QFLAG2 36-36 Character
SFLAG2 37-37 Character
. . .
. . .
. . .
VALUE31 262-266 Integer
MFLAG31 267-267 Character
QFLAG31 268-268 Character
SFLAG31 269-269 Character
------------------------------
Session Info
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rnoaa_1.3.0 revealjg_0.9.9006 rprojroot_2.0.2 reticulate_1.18 lubridate_1.7.9.2 forcats_0.5.0
[7] stringr_1.4.0 dplyr_1.0.3 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.5
[13] ggplot2_3.3.3 tidyverse_1.3.0 yaml_2.2.1 rmarkdown_2.6.6.9000 knitr_1.30 pacman_0.5.1
loaded via a namespace (and not attached):
[1] fs_1.5.0 usethis_2.0.0 RColorBrewer_1.1-2 httr_1.4.2 tools_4.0.3 backports_1.2.1
[7] bslib_0.2.3.9000 utf8_1.1.4 R6_2.5.0 DBI_1.1.1 colorspace_2.0-0 withr_2.4.0
[13] tidyselect_1.1.0 gridExtra_2.3 processx_3.4.5 curl_4.3 compiler_4.0.3 cli_2.2.0
[19] rvest_0.3.6 xml2_1.3.2 triebeard_0.3.0 sass_0.3.0.9000 scales_1.1.1 callr_3.5.1
[25] askpass_1.1 rappdirs_0.3.1 digest_0.6.27 pkgconfig_2.0.3 htmltools_0.5.1.1 dbplyr_2.0.0
[31] rlang_0.4.10 readxl_1.3.1 rstudioapi_0.13 httpcode_0.3.0 jquerylib_0.1.3 generics_0.1.0
[37] jsonlite_1.7.2 magrittr_2.0.1 credentials_1.3.0 Matrix_1.3-2 Rcpp_1.0.6 munsell_0.5.0
[43] fansi_0.4.2 clipr_0.7.1 lifecycle_0.2.0 stringi_1.5.3 whisker_0.4 grid_4.0.3
[49] crayon_1.3.4 slider_0.1.5 lattice_0.20-41 haven_2.3.1 hms_1.0.0 sys_3.4
[55] ps_1.5.0 pillar_1.4.7 crul_1.0.0 reprex_0.3.0 XML_3.99-0.5 glue_1.4.2
[61] evaluate_0.14 hoardr_0.5.2 data.table_1.13.6 remotes_2.2.0 modelr_0.1.8 vctrs_0.3.6
[67] urltools_1.7.3 cellranger_1.1.0 gtable_0.3.0 openssl_1.4.3 assertthat_0.2.1 xfun_0.20
[73] broom_0.7.3.9000 warp_0.2.0 ellipsis_0.3.1 here_1.0.1
thanks for catching that!
up for helping out by sending a PR?
I added the ghcnd_read
fxn as an afterthought not thinking it through all the way. When we download the files with ghncd()
they are in the fixed width format you describe, but then before writing them to disk we process them to make them more digestible - see https://github.com/ropensci/rnoaa/blob/master/R/ghcnd.R#L202-L223 - THEN they are written to disk in comma sep format
So probably ideally we change gchnd_read()
to read a file directly from NOAA in fixed width format AND in comma sep format (i.e., from a call to ghcnd()
) - Sound good?
So, i think we:
- factor out the code in
ghcnd_GET
to process a fwf file to a fxn, e.g.,process_fwf
- use
process_fwf
inghcnd_GET
to replace the code just factored out - use
process_fwf
inghcnd_read
if the file is fwf, or simply read as comma sep format if already a csv
I will be happy to submit a PR to fix this. It may take a little while for me to get to it, but I will be happy to do this if you're not in a hurry.