ropensci/stats19

get_stats19 fail with multiple years and casualties

Closed this issue · 15 comments

Reprex:

casualties = get_stats19(2012:2018, type = "cas")

Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

This is a reprex of the problem:

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

casualties_2013 <- get_stats19(2013, type = "cas", silent = TRUE)
ncol(casualties_2013)
#> [1] 14
casualties_2014 <- get_stats19(2014, type = "cas", silent = TRUE)
ncol(casualties_2014)
#> [1] 15
get_stats19(2013:2014, type = "cas")
#> Files identified: DfTRoadSafety_Casualties_2013.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2013.zip
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpiVw2cU/DfTRoadSafety_Casualties_2013/DfTRoadSafety_Casualties_2013.csv
#> Files identified: DfTRoadSafety_Casualties_2014.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2014.zip
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpiVw2cU/DfTRoadSafety_Casualties_2014/DfTRoadSafety_Casualties_2014.csv
#> Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

Created on 2020-02-18 by the reprex package (v0.3.0)

while in other cases it works

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

casualties_2017 <- get_stats19(2017, type = "cas")
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2017/Cas.csv
ncol(casualties_2017)
#> [1] 16
casualties_2018 <- get_stats19(2018, type = "cas")
#> Files identified: dftRoadSafetyData_Casualties_2018.csv
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2018.csv
#> Attempt downloading from:
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2018.csv
ncol(casualties_2018)
#> [1] 16
casualties_2017_2018 <- get_stats19(c(2017, 2018), type = "cas")
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2017/Cas.csv
#> Files identified: dftRoadSafetyData_Casualties_2018.csv
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2018.csv
#> Data already exists in data_dir, not downloading
#> Data saved at /tmp/RtmpoMV58I/dftRoadSafetyData_Casualties_2018.csv
ncol(casualties_2017_2018)
#> [1] 16
nrow(casualties_2017) + nrow(casualties_2018) == nrow(casualties_2017_2018)
#> [1] TRUE

Created on 2020-02-18 by the reprex package (v0.3.0)

So this is a problem that occurs when we use rbind with two or more data.frame/tibbles/sf having different columns.

rbind(data.frame(x = 1), data.frame(x = 1, y = 2))
#> Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

library(tibble)
rbind(tibble(x = 1), tibble(x = 1, y = 2))
#> Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

Created on 2020-02-18 by the reprex package (v0.3.0)

I'm not sure what's the best solution. My first idea now would be using dplyr::bind_rows (and maybe that implies moving dplyr to Imports, I'm not sure) but, as far as I know, at the moment there are some frictions between bind_rows and sf objects: r-spatial/sf#49.

# packages
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

a = st_sf(a=1, geom = st_sfc(st_point(0:1)), crs = 3857)
b = st_sf(a=1, b = 2, geom = st_sfc(st_linestring(matrix(1:4,2))), crs = 3857)

rbind(a, b)
#> Error in rbind.data.frame(...): numbers of columns of arguments do not match
bind_rows(a, b)
#> Warning in bind_rows_(x, .id): Vectorizing 'sfc_POINT' elements may not preserve
#> their attributes
#> Warning in bind_rows_(x, .id): Vectorizing 'sfc_LINESTRING' elements may not
#> preserve their attributes
#> Error in .subset2(x, i, exact = exact): attempt to select less than one element in get1index

Created on 2020-02-18 by the reprex package (v0.3.0)

Maybe we could simply raise an error in case of multiple years and output_format == "sf". Another solution would be to just take the common columns and return the rbind of common columns. Suggestion?

I suggest the base R solution that keeps all columns would be good, e.g. as shown here: https://stackoverflow.com/a/46635610/1694378

Update, having thought about it and played with it I've changed my mind... I think dplyr::bind_rows() may be the least bad option, e.g.:

cas_list = lapply(2013:2018, function(i) stats19::get_stats19(i, type = "cas"))
#> Files identified: DfTRoadSafety_Casualties_2013.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2013.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/DfTRoadSafety_Casualties_2013/DfTRoadSafety_Casualties_2013.csv
#> Files identified: DfTRoadSafety_Casualties_2014.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2014.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/DfTRoadSafety_Casualties_2014/DfTRoadSafety_Casualties_2014.csv
#> Files identified: RoadSafetyData_Casualties_2015.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/RoadSafetyData_Casualties_2015.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/RoadSafetyData_Casualties_2015/Casualties_2015.csv
#> Files identified: dftRoadSafetyData_Casualties_2016.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2016.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/dftRoadSafetyData_Casualties_2016/Cas.csv
#> Files identified: dftRoadSafetyData_Casualties_2017.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2017.zip
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/dftRoadSafetyData_Casualties_2017/Cas.csv
#> Files identified: dftRoadSafetyData_Casualties_2018.csv
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Casualties_2018.csv
#> Attempt downloading from:
#> Data saved at /tmp/RtmpyNqXnE/dftRoadSafetyData_Casualties_2018.csv
sapply(cas_list, ncol)
#> [1] 14 15 16 16 16 16
# sapply(cas_list, names)
names_diff = setdiff(names(cas_list[[5]]), names(cas_list[[1]]))
names_all = unique(unlist(lapply(cas_list, names)))

res = dplyr::bind_rows(cas_list)
res
#> # A tibble: 1,077,310 x 16
#>    accident_index vehicle_referen… casualty_refere… casualty_class
#>    <chr>                     <int>            <int> <chr>         
#>  1 201301BS70003                 2                1 Driver or rid…
#>  2 201301BS70005                 1                1 Driver or rid…
#>  3 201301BS70005                 1                2 Pedestrian    
#>  4 201301BS70006                 1                1 Passenger     
#>  5 201301BS70007                 1                1 Driver or rid…
#>  6 201301BS70009                 1                1 Driver or rid…
#>  7 201301BS70010                 1                1 Pedestrian    
#>  8 201301BS70012                 2                1 Driver or rid…
#>  9 201301BS70013                 1                1 Pedestrian    
#> 10 201301BS70015                 1                1 Pedestrian    
#> # … with 1,077,300 more rows, and 12 more variables: sex_of_casualty <chr>,
#> #   age_band_of_casualty <chr>, casualty_severity <chr>,
#> #   pedestrian_location <chr>, pedestrian_movement <chr>, car_passenger <chr>,
#> #   bus_or_coach_passenger <chr>, pedestrian_road_maintenance_worker <chr>,
#> #   casualty_type <chr>, casualty_home_area_type <chr>, age_of_casualty <int>,
#> #   casualty_imd_decile <chr>
res[names_diff]
#> # A tibble: 1,077,310 x 2
#>    age_of_casualty casualty_imd_decile
#>              <int> <chr>              
#>  1              NA <NA>               
#>  2              NA <NA>               
#>  3              NA <NA>               
#>  4              NA <NA>               
#>  5              NA <NA>               
#>  6              NA <NA>               
#>  7              NA <NA>               
#>  8              NA <NA>               
#>  9              NA <NA>               
#> 10              NA <NA>               
#> # … with 1,077,300 more rows
summary(.Last.value)
#>    Mode    TRUE 
#> logical       1

Created on 2020-02-18 by the reprex package (v0.3.0)

Remember: cas will never be an sf object...

Plan to give this some thought next week, any further ideas welcome. Thanks for the reprex @agila5, definitely food for thought and eventually action to solve this pesky bug!

layik commented

Hi Andrea and hi Robin. I think you both know my position. I will only resort to non-base if it is the only option. I hope I can find some time to give it some thought too.

Happy with non-base solution, if I can understand it! Thoughts @agila5 ?

I would also prefer a base-R solution wrt dplyr::bind_rows but I'm not 100% happy with "rbind-ing using common columns" because it drops the age_of_casuality and imd_decile variables in the previous examples, i.e.

# packages
library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

cas_2013 <- get_stats19(2013, type = "cas", silent = TRUE)
cas_2014 <- get_stats19(2014, type = "cas", silent = TRUE)
cas_2015 <- get_stats19(2015, type = "cas", silent = TRUE)

names(cas_2014)[!names(cas_2014) %in% names(cas_2013)]
#> [1] "age_of_casualty"
names(cas_2015)[!names(cas_2015) %in% names(cas_2014)]
#> [1] "casualty_imd_decile"

Created on 2020-02-19 by the reprex package (v0.3.0)

and I think they could be important from a modeling perspective. I will think about it in the next days.

layik commented

Thanks @agila5, it seems there is a base way of doing this which would not be "very" expensive in our case, but I just could not play with the do.call(rbind, all) to get you some code.

rbind(
        data.frame(c(df1, sapply(setdiff(names(df2), names(df1)), function(x) NA))),
        data.frame(c(df2, sapply(setdiff(names(df1), names(df2)), function(x) NA)))
      )

From great SO.

layik commented

I assume that is or something similar to that what dplyr::bind_rows does, and possibly run even more lines of code.

layik commented

This should go into tests really:

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
cas_2013 <- get_stats19(2013, type = "cas", silent = TRUE)
cas_2014 <- get_stats19(2014, type = "cas", silent = TRUE)
cas_2015 <- get_stats19(2015, type = "cas", silent = TRUE)

names(cas_2014)[!names(cas_2014) %in% names(cas_2013)]
#> [1] "age_of_casualty"
names(cas_2015)[!names(cas_2015) %in% names(cas_2014)]
#> [1] "casualty_imd_decile"

c = get_stats19(2013:2016, type = "cas")
#> Files identified: DfTRoadSafety_Casualties_2013.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Casualties_2013.zip
#> Data already exists in data_dir, not downloading
#> Data saved at ...

any(grepl("age_of_casualty", names(c)))
#> [1] TRUE
any(grepl("casualty_imd_decile", names(c)))
#> [1] TRUE

t = get_stats19(2013:2016, output_format = "sf")
#> Files identified: DfTRoadSafety_Accidents_2013.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/DfTRoadSafety_Accidents_2013.zip
#> Attempt downloading from:
#> Data saved at ....
#> 7 rows removed with no coordinates
class(t)
#> [1] "sf"         "data.frame"

Created on 2020-02-20 by the reprex package (v0.3.0)

layik commented

Do we need a separate ticket to change "year" to "years", would that be a breaking change?

ping #153

Thanks for the solution, I will test it carefully in the next days!

Anyway I'm not sure I understand your question. Do you want to change the name of the parameter from year to years? I wouldn't do that and I would just update the docs overwriting the description for the year parameter.

layik commented

Please open if more work is needed.

Great work @layik, harder than expected but will make important mult-year analysis more accessible.