Text in a double is creating NA `accident_index` values

Question

Text in a double is creating NA `accident_index` values

tra6sdc opened this issue 9 months ago · 3 comments

Hello,

accident_index is a concatenation of accident_year and accident_reference.
accident_index is of type double, but accident_reference is of type character.
This means that some accident_index values are NA

> casualty_2018<-get_stats19(year = 2018, type = "casualty", format = TRUE)
Files identified: dft-road-casualty-statistics-casualty-2018.csv

   https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-2018.csv
Data already exists in data_dir, not downloading

-- Column specification -----------------------------------------------------------------------------------------
cols(
  accident_index = col_double(),
  accident_year = col_double(),
  accident_reference = col_character(),
  vehicle_reference = col_double(),
  casualty_reference = col_double(),
  casualty_class = col_double(),
  sex_of_casualty = col_double(),
  age_of_casualty = col_double(),
  age_band_of_casualty = col_double(),
  casualty_severity = col_double(),
  pedestrian_location = col_double(),
  pedestrian_movement = col_double(),
  car_passenger = col_double(),
  bus_or_coach_passenger = col_double(),
  pedestrian_road_maintenance_worker = col_double(),
  casualty_type = col_double(),
  casualty_home_area_type = col_double(),
  casualty_imd_decile = col_double(),
  lsoa_of_casualty = col_character()
)

Warning: 22715 parsing failures.
  row            col               expected        actual                                                                                                  file
30320 accident_index no trailing characters 201801T266389 'C:\Users\tra6sdc\AppData\Local\Temp\RtmpQZxJ35/dft-road-casualty-statistics-casualty-2018.csv'
30321 accident_index no trailing characters 201801T271905 'C:\Users\tra6sdc\AppData\Local\Temp\RtmpQZxJ35/dft-road-casualty-statistics-casualty-2018.csv'
30322 accident_index no trailing characters 201801T274868 'C:\Users\tra6sdc\AppData\Local\Temp\RtmpQZxJ35/dft-road-casualty-statistics-casualty-2018.csv'
30323 accident_index no trailing characters 201801T274868 'C:\Users\tra6sdc\AppData\Local\Temp\RtmpQZxJ35/dft-road-casualty-statistics-casualty-2018.csv'
30324 accident_index no trailing characters 201801T278015 'C:\Users\tra6sdc\AppData\Local\Temp\RtmpQZxJ35/dft-road-casualty-statistics-casualty-2018.csv'
..... .............. ...................... .. [... truncated]

Answer 1 · 2023-11-14T10:45:17.000Z

Additionally, sometimes it creates a very inflated accident_index

       accident_index accident_year accident_reference vehicle_reference               vehicle_type
78577    2.018135e+12          2018          1352F0005                 1                        Car
78720    2.018135e+61          2018          1352L0054                 1                        Car
78721    2.018135e+61          2018          1352L0054                 2                        Car
82295    2.018136e+63          2018          1358F0056                 1 Motorcycle 125cc and under
127865   2.018340e+67          2018          340D00061                 1                        Car
77910    2.018135e+72          2018          1351F0065                 1                Pedal cycle
77911    2.018135e+72          2018          1351F0065                 2                        Car
81982    2.018136e+79          2018          1357S0072                 1      Taxi/Private hire car
214387  2.018630e+123          2018          63D000118                 1                        Car
214388  2.018630e+123          2018          63D000118                 2       Agricultural vehicle

Answer 2 · 2023-11-14T11:01:58.000Z

Thanks for raising the issue. Any thoughts of the underlying cause and solution?

My thinking: we can set the type with cols(): https://readr.tidyverse.org/reference/cols.html

Answer 3 · 2023-11-14T11:07:55.000Z

Yep, explicitly specify the column data types rather than allowing R to guess. The inflated accident_index values may be a hexadecimal thing (but 'L' isn't a hexadecimal character) and the above solution might deal with this issue too. I believe that this isn't something I can do.