ropensci/stats19

Data quality question/issue

timcoote opened this issue · 8 comments

This is really a point about the data, but it impacts analyses done with it.

I believe that the speed limit column of accident data has errors in it. As an example, Accident Index '2009460170410' has a speed limit of 30. However, that Latitude, Longitude location is in a 40mph zone, which, I believe, dates back to the last millennium.

I realise that this issue is no under the repo's control, but I thought I document it/bring it to attention so that it can be confirmed/invalidated.

I have reported this to DfT.

Thanks for reporting the issue @timcoote. Agree there are quality issues with the data, and a few issues that have come to light in a public forum thanks to community input around this package (see #101 and #91 which we are close to acting on in #178 for example). It's good that you've reported the issue to the DfT. Our approach is to link to the DfT documentation and be only a provider of the data, staying faithful to the raw data. If you see opportunities to improve the documentation/code in any way please do let us know.

It strikes me that there would be value in a document on data quality, highlighting issues as they become notices. This is quite a long timeseries and, I suspect, this project is going to expose it to many more eyes than have seen it in the past. #101 is a very good example of change over time, this one is, I suspect, an issue with the data never having been checked. (there may be a similar issue with Latitude and Longitude, or how they map onto Google's maps as I can see several accidents inside shops ;-) )

An early heads up on where the bear traps are will make life easier for newcomers to the data.

Such a document could also track interactions with DfT in improving their data governance and quality, demonstrating the value of the repo/opening up the data.

Such a document could also track interactions with DfT in improving their data governance and quality, demonstrating the value of the repo/opening up the data.

There is no specific suggestion here but I will keep this issue open in the hope that it encourages further feedback and Pull Requests from others, especially domain specialists who work with this data on a daily basis like analysts who work at Agilysis (hint ; ).

PRs welcome.

If you just want a PR for a "Known Data Quality Issues" document, I could kick one of those off for you.

I did get a response from DfT:
"""
There might be differences between the reported speed limit and the actual speed limit of the road where the accident happened for the flowing reasons:

  • DfT does not validate speed limits as there is not a national speed limit data base, therefore we rely on the speed limit reported.
  • Police officers do not attend all accidents, some of the STATS19 accidents are reported by members of the public, therefore the information reported is not as reliable as the one reported by police officers.
  • Speed limit change.
    """

I'd already checked the second and third of these as not being the source of the error in at least one case.

I fell into issues with #101 when trying to estimate changes in accident rates in areas, so I had a quick look at other years, and raised the issue of quality checking / updates with DfT, proposing that they support some sort of effort to post-process the data to iron out known issues/document what's left behind. (I'm noting this here as that issue is closed and this one is, I believe, left to track data quality issues).

I wanted to document that python's basemap package has too crude a resolution to be useful. Even at full resolution, approx ~70% of locations are falsely identifed as not is_land :-(

On the upside, 1999, et seq seem to have < 0.02% of locations in water, assuming that an elevation of 0.0 reported from open-elevation is a reasonable proxy for being in the water. It's not perfect, but it looks reasonable. A better check may be to look for locations out of police area, but I quickly gave up on that. If I get time, I may try to estimate the error rate up to 1998.

Great stuff @timcoote thanks for the updates, keep plugging away at it. Any updates / thoughts on implications for this package and how to make it better: v. welcome.

@Robinlovelace one thing that did pop out when I looked at the 2022 data is the inclusion of vehicle information. So I was going to try a quick cross check on the number of licensed vehicles just to see if there are any obviously over-represented manufacturers/models/types. (this is probably the wrong thread for this comment, but it was the first that came to hand)

Sounds good to me, good luck with it and do keep us posted!