ropensci/stats19

get_stats19

weijia2013 opened this issue · 3 comments

When I am using:

get_stats19(year = 2005, type = "accidents", data_dir = "XXX"), directory has been replaced. I got error

No files of that type found for that year.
No files found. Check the stats19 website on data.gov.uk
Files identified:
Error in if (data_already_exists) { : argument is of length zero

But when I am using:
get_stats19(year = 2005 - 2024, type = "accidents", data_dir = "XXX")

I can download the data from 1979 to 2023, which includes 2005 data.

Why would this be happening?

Not sure but agree it could be clearer. I will look to fix this. The main issue is that data only exists per year for last 5 years, before that we should default to the huge 1979-2023 dataset. Thanks for reporting and any further feedback or ideas for fix let me know, does the plan outlined above sound good to you (such that if you set year = 2005 you will get the data from 1979)?

It sounds a good plan. Thanks for the updating.

I think this error is from the find_file_name function in the utils module. It is searching for a file mentioning the specific year, which as Robin says doesn't exist for anything before 2018.

find_file_name = function(years = NULL, type = NULL) {
  result = unlist(stats19::file_names, use.names = FALSE)
  if(!is.null(years)) {
    if(min(years) >= 2016) {
      result = result[!grepl(pattern = "1979", x = result)]
    }
    result = result[!grepl(pattern = "adjust", x = result)]
    result = result[grepl(pattern = years, x = result)]
    }

  # see https://github.com/ITSLeeds/stats19/issues/21
  if(!is.null(type)) {
    type = gsub(pattern = "cas", replacement = "ics-cas", x = type)
    result_type = result[grep(pattern = type, result, ignore.case = TRUE)]
    if(length(result_type) > 0) {
      result = result_type
    } else {
      if(is.null(years)) {
       stop("No files of that type found", call. = FALSE)
      } else {
        message("No files of that type found for that year.")
      }
    }
  }

  if(length(result) < 1) {
    message("No files found. Check the stats19 website on data.gov.uk")
  }
  unique(result)
}

I changed it to this. The only change is the first part, but including the full function so can cut and paste.

find_file_name = function(years = NULL, type = NULL) {
  result = unlist(stats19::file_names, use.names = FALSE)
  if(!is.null(years)) {
    if(min(years) >= 2018) {
      result = result[grepl(pattern = years, x = result)]
    }
    if(min(years) <= 2017) {
    result = result[!grepl(pattern = "adjust", x = result)]
    result = result[grepl(pattern = "1979", x = result)]
    }

  # see https://github.com/ITSLeeds/stats19/issues/21
  if(!is.null(type)) {
    type = gsub(pattern = "cas", replacement = "ics-cas", x = type)
    result_type = result[grep(pattern = type, result, ignore.case = TRUE)]
    if(length(result_type) > 0) {
      result = result_type
    } else {
      if(is.null(years)) {
       stop("No files of that type found", call. = FALSE)
      } else {
        message("No files of that type found for that year.")
      }
    }
  }

  if(length(result) < 1) {
    message("No files found. Check the stats19 website on data.gov.uk")
  }
  unique(result)
  }
}

This was included in the pull request from last week, which I am using on my machine, but is failing the automated checks, if anyone has a chance to have a look.

I also wondered since so much effort has gone into utilising the local store of the downloaded data it might be helpful to add an extra step to split the 1979-2023 dataset into years. So each file is saved as type_year.RDS and other functions work off that. Would add a little bit of time to the import step, but speed up analysis of multiple years?