The following project is an Exploratory Data Analysis based on a data set obtained from kaggle, US Accidents, which was intended to be my first kernel in Kaggle. For that reason I decided to work the code on my personal computer and upload the updates to kaggle every now and then. I have decided to keep a copy on github in order to keep all my work of data processing together in one website. However, the kaggle kernel that I created can be found here

INTRODUCTION

With the following exploratory analysis we will see how much parameters like humidity and traffic signals influence the accidents in the USA, and why some others are not so decisive. We will explore how in different states different parameters are more important than others, and how different variables could help us to predict accidents.

We can get started, but first things first: calling the required libraries and the data:

library(tidyverse)
library(cowplot)
library(lubridate)

#US_accidents <- read_csv('../input/us-accidents/US_Accidents_May19.csv')
load('eda.RData')

Now we can start with a general overview of the data, the tibble, and a quick summary of the numeric variables using the basic R function summary

## First view
US_accidents

## # A tibble: 2,243,939 x 49
##    ID    Source   TMC Severity Start_Time          End_Time           
##    <chr> <chr>  <dbl>    <dbl> <dttm>              <dttm>             
##  1 A-1   MapQu…   201        3 2016-02-08 05:46:00 2016-02-08 11:00:00
##  2 A-2   MapQu…   201        2 2016-02-08 06:07:59 2016-02-08 06:37:59
##  3 A-3   MapQu…   201        2 2016-02-08 06:49:27 2016-02-08 07:19:27
##  4 A-4   MapQu…   201        3 2016-02-08 07:23:34 2016-02-08 07:53:34
##  5 A-5   MapQu…   201        2 2016-02-08 07:39:07 2016-02-08 08:09:07
##  6 A-6   MapQu…   201        3 2016-02-08 07:44:26 2016-02-08 08:14:26
##  7 A-7   MapQu…   201        2 2016-02-08 07:59:35 2016-02-08 08:29:35
##  8 A-8   MapQu…   201        3 2016-02-08 07:59:58 2016-02-08 08:29:58
##  9 A-9   MapQu…   201        2 2016-02-08 08:00:40 2016-02-08 08:30:40
## 10 A-10  MapQu…   201        3 2016-02-08 08:10:04 2016-02-08 08:40:04
## # … with 2,243,929 more rows, and 43 more variables: Start_Lat <dbl>,
## #   Start_Lng <dbl>, End_Lat <lgl>, End_Lng <lgl>, `Distance(mi)` <dbl>,
## #   Description <chr>, Number <dbl>, Street <chr>, Side <chr>, City <chr>,
## #   County <chr>, State <chr>, Zipcode <chr>, Country <chr>,
## #   Timezone <chr>, Airport_Code <chr>, Weather_Timestamp <dttm>,
## #   `Temperature(F)` <dbl>, `Wind_Chill(F)` <dbl>, `Humidity(%)` <dbl>,
## #   `Pressure(in)` <dbl>, `Visibility(mi)` <dbl>, Wind_Direction <chr>,
## #   `Wind_Speed(mph)` <dbl>, `Precipitation(in)` <dbl>,
## #   Weather_Condition <chr>, Amenity <lgl>, Bump <lgl>, Crossing <lgl>,
## #   Give_Way <lgl>, Junction <lgl>, No_Exit <lgl>, Railway <lgl>,
## #   Roundabout <lgl>, Station <lgl>, Stop <lgl>, Traffic_Calming <lgl>,
## #   Traffic_Signal <lgl>, Turning_Loop <lgl>, Sunrise_Sunset <chr>,
## #   Civil_Twilight <chr>, Nautical_Twilight <chr>,
## #   Astronomical_Twilight <chr>

## Numeric vars
US_accidents %>%
    select(Severity,
           `Distance(mi)`,
           `Temperature(F)`,
           `Wind_Chill(F)`,
           `Humidity(%)`,
           `Pressure(in)`,
           `Visibility(mi)`,
           `Wind_Speed(mph)`,
           `Precipitation(in)`) %>%
    summary()

##     Severity      Distance(mi)      Temperature(F)   Wind_Chill(F)    
##  Min.   :0.000   Min.   :  0.0000   Min.   :-77.80   Min.   :-65.9    
##  1st Qu.:2.000   1st Qu.:  0.0000   1st Qu.: 48.90   1st Qu.: 19.2    
##  Median :2.000   Median :  0.0000   Median : 63.00   Median : 28.7    
##  Mean   :2.383   Mean   :  0.2879   Mean   : 61.23   Mean   : 26.0    
##  3rd Qu.:3.000   3rd Qu.:  0.0100   3rd Qu.: 75.90   3rd Qu.: 36.4    
##  Max.   :4.000   Max.   :333.6300   Max.   :170.60   Max.   : 45.2    
##                                     NA's   :62265    NA's   :1852370  
##   Humidity(%)      Pressure(in)   Visibility(mi)   Wind_Speed(mph) 
##  Min.   :  4.00   Min.   : 0.00   Min.   :  0.00   Min.   :  1.2   
##  1st Qu.: 50.00   1st Qu.:29.92   1st Qu.: 10.00   1st Qu.:  5.8   
##  Median : 68.00   Median :30.03   Median : 10.00   Median :  8.1   
##  Mean   : 65.93   Mean   :30.04   Mean   :  9.12   Mean   :  8.8   
##  3rd Qu.: 85.00   3rd Qu.:30.15   3rd Qu.: 10.00   3rd Qu.: 11.5   
##  Max.   :100.00   Max.   :33.04   Max.   :140.00   Max.   :822.8   
##  NA's   :64467    NA's   :57280   NA's   :71360    NA's   :442954  
##  Precipitation(in)
##  Min.   : 0.0     
##  1st Qu.: 0.0     
##  Median : 0.0     
##  Mean   : 0.1     
##  3rd Qu.: 0.0     
##  Max.   :10.8     
##  NA's   :1979466

By default, calling the data as a tibble gives us already valuable information on the data set. Using summary on selected variables returns a first description of our data, and the first impressions about the accidents:

There are extreme temperatures, from -77 to 170 F
Wind chill also extreme to -65
Maximum wind speed reaches 822 mph
Humidity reaches 100 %
Precipitation, distance affected and visibility do not seem to be of big importance

We can go now to the Boolean variables:

## Boolean vars
US_accidents %>%
    select(Amenity:Turning_Loop) %>%
    summary()

##   Amenity           Bump          Crossing        Give_Way      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:2217962   FALSE:2243700   FALSE:2122156   FALSE:2239215  
##  TRUE :25977     TRUE :239       TRUE :121783    TRUE :4724     
##   Junction        No_Exit         Railway        Roundabout     
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:2056574   FALSE:2241773   FALSE:2225741   FALSE:2243811  
##  TRUE :187365    TRUE :2166      TRUE :18198     TRUE :128      
##   Station           Stop         Traffic_Calming Traffic_Signal 
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:2207714   FALSE:2222168   FALSE:2243321   FALSE:1885291  
##  TRUE :36225     TRUE :21771     TRUE :618       TRUE :358648   
##  Turning_Loop   
##  Mode :logical  
##  FALSE:2243939  
##

## Graphical representation
signals <- US_accidents %>%
    select(Amenity:Turning_Loop) %>%
    pivot_longer( cols = Amenity:Turning_Loop,
        names_to = 'Annotation',
        values_to = 'Trues') %>%
    filter(Trues == TRUE) %>%
    group_by(Annotation) %>%
    summarise(Total = n())

signals %>%
    ggplot() +
    geom_bar(aes(y = Total,
                 x = reorder(Annotation, Total, FUN = abs),
                 fill = Total),
             stat = 'identity') +
    coord_flip() +
    labs(x = NULL) +
    theme(legend.position="none")

We can also see that most of the accidents happen at traffic signals, then junctions and crossing.

## Other variables
##
## API which reported accident
unique(US_accidents$Source)

## [1] "MapQuest"      "MapQuest-Bing" "Bing"

## Traffic message channel
unique(US_accidents$TMC)

##  [1] 201 241 247 246 341 406 245 248 200 244 203 229 222 202 206 343 236
## [18] 239 336 339 351  NA

## Severity: 1 is lesser, 4 is higher
unique(US_accidents$Severity)

## [1] 3 2 1 4 0

There are 3 sources of the data, 21 traffic message channels and 4 values of severity.

NUMERIC VARIABLES

Now we are going deeper into the numeric variables that could possibly have certain influence in the occurrence of accidents, due to their extreme values. Therefore, we start by observing if the extreme values (such as too cold or too hot weather) are a normal trend for the accidents to happen, or they are rather atypical values.

## Temperature
US_accidents %>%
    ggplot() +
    geom_histogram(aes(x = `Temperature(F)`))

## Central view on the main trend
US_accidents %>%
    filter(`Temperature(F)` > 0 &
           `Temperature(F)` <= 100) %>%
    ggplot() +
    geom_histogram(aes(x = `Temperature(F)`))

## More specific view at the main values
US_accidents %>%
    filter(`Temperature(F)` > 45 &
           `Temperature(F)` <= 80) %>%
    ggplot() +
    geom_histogram(aes(x = `Temperature(F)`))

In the first plot we can immediately see that below 0 and above 100°F there is practically nothing. Therefore we create a second plot centred between this values. Here we can see that most of the accidents happen between 45 and 80°F, with some particular peaks. Thus, we create a third plot where we can see those peaks in more detail. There is a particular trend of a considerable increase in the number of accidents marked by a peak, and then a drop, follow by a slow increase, and then the same trend. This could be caused most probably by the instruments of measurement in the meteorological station, to certain extent.

## Wind chill
US_accidents %>%
    ggplot() +
    geom_histogram(aes(x = `Wind_Chill(F)`))

US_accidents %>%
    select(`Wind_Chill(F)`) %>%
    filter(`Wind_Chill(F)` < 30) %>%
    summarise(n())

## # A tibble: 1 x 1
##    `n()`
##    <int>
## 1 209936

There is an increase in the number of accidents with the increase of the wind temperature, with the peak at 75°F. The number of accidents at really cold temperatures (here I chose 30°F) is rather small compared with the total accidents: 209,936 out of 2,243,939, or 9.35 percent

## Wind speed
US_accidents %>%
    ggplot() +
    geom_histogram(aes(x = `Wind_Speed(mph)`))

US_accidents %>%
    select(`Wind_Speed(mph)`) %>%
    filter(`Wind_Speed(mph)` < 25) %>%
    ggplot() +
    geom_histogram(aes(x = `Wind_Speed(mph)`))

Despite of the fact that maximum wind speed reaches 822.8 mph, we can see in our first plot that most of the accidents happened at rather low speed. More interesting: when we zoom into the values below 25 mph we can see again a trend of gaps between certain values. This is clearly a measurement issue. Thus, our theory about the temperature problems could be certainly related to the measurement instruments.

## Humidity
US_accidents %>%
    ggplot() +
    geom_histogram(aes(x = `Humidity(%)`))

US_accidents %>%
    select(`Humidity(%)`) %>%
    filter(`Humidity(%)` > 35) %>%
    ggplot() +
    geom_histogram(aes(x = `Humidity(%)`))

Unlike our previous variables, we can see that the number of accidents increases when the humidity increases.

STATES

Now we can explore a little bit the accidents in different states: which state has the main occurrence of accidents? Which one the most severe ones? Is there a relationship with the traffic lights or the humidity as we discovered early?

We can start with the basic questions, amount of accidents per state and severity:

## Amount of accidents per state
US_accidents %>%
    select(State) %>%
    group_by(State) %>%
    summarise(Total = n()) %>%
    ggplot() +
    geom_bar(aes(y = Total,
                 x = reorder(State, Total, FUN = abs),
                 fill = Total),
             stat = 'identity') +
    coord_flip() +
    labs(x = NULL) +
    theme(legend.position="none")

## Amount per state/severity, proportional
US_accidents %>%
    select(Severity, State) %>%
    mutate(State = parse_factor(State),
           Severity = parse_factor(as.character(Severity))) %>%
    group_by(State, Severity) %>%
    summarise(Total = n()) %>%
    ggplot() +
    geom_bar(aes(y = Total,
                 x = State,
                 fill = Severity),
             stat = 'identity',
             position = 'fill') +
    coord_flip() +
    scale_fill_brewer(palette = 'BuPu') +
    labs(x = NULL)

At first glance it is clear that most of the accidents were reported from CA, then TX and FL. As for the severity, most of the records are for severity levels 2 and 3. It can be interesting to see if the states where most of the accidents happen are also correlated with the severity. For this I have created a function to to plot each severity value and filter values above a given number of observations. Using this function we can later adapt each graphic for a better visualisation:

## Function
plot_SevState <- function(sev, min = 10000) {
    US_accidents %>%
        select(State, Severity) %>%
        filter(Severity == sev) %>%
        group_by(State) %>%
        summarise(Total = n()) %>%
        filter(Total > min) %>%
        ggplot() +
        geom_bar(aes(y = Total,
                     x = reorder(State, Total, FUN = abs),
                     fill = Total),
                 stat = 'identity') +
        coord_flip() +
        labs(x = NULL, y = paste("Severity", sev, "Total")) +
        theme(legend.position="none")
}

## Visualization
plot_grid(
    plot_SevState(1, min = 50),
    plot_SevState(2, min = 50000),
    plot_SevState(3),
    plot_SevState(4, min = 2000))

As expected, CA is in the first positions of the four plots, while TX and FL are also around. When it comes to the accidents with severity level 4, we see FL in the first position, followed by GA and CA. The order vary, but the difference between each state is minimal compared with the number of entries that we have.

We can also check the relationship between states and our most influencial variables, such as humidity and traffict signals.

## Relationship with humidity
US_accidents %>%
    select(State, `Humidity(%)`, Severity) %>%
    filter(Severity != 0) %>%
    mutate(State = parse_factor(State),
           Severity = parse_factor(as.character(Severity))) %>%
    group_by(State, Severity) %>%
    summarise(Humidity = mean(`Humidity(%)`, na.rm = T)) %>%
    ggplot(aes(x = State, y = Humidity)) +
    geom_point(aes(colour = Severity, shape = Severity),
               size = 6) +
    scale_fill_brewer(palette = 'BuPu') +
    labs(y = 'Average humidity')

US_accidents %>%
    select(State, `Humidity(%)`, Severity) %>%
    filter(Severity != 0) %>%
    mutate(State = parse_factor(State),
           Severity = parse_factor(as.character(Severity))) %>%
    group_by(State, Severity) %>%
    summarise(Humidity = mean(`Humidity(%)`, na.rm = T)) %>%
    filter(Humidity > 70) %>%
    ggplot(aes(x = State, y = Humidity)) +
    geom_point(aes(colour = Severity, shape = Severity),
               size = 6) +
    scale_fill_brewer(palette = 'BuPu') +
    labs(y = 'Average humidity')

## States with main accidents at traffic signals
US_accidents %>%
    select(State, Traffic_Signal) %>%
    filter(Traffic_Signal == TRUE) %>%
    group_by(State) %>%
    summarise(Total = n()) %>%
    filter(Total > 5000) %>%
    ggplot() +
    geom_bar(aes(y = Total,
                 x = reorder(State, Total, FUN = abs),
                 fill = Total),
             stat = 'identity') +
    coord_flip() +
    labs(x = NULL, y = 'No. of accidents at traffic signals') +
    theme(legend.position="none")

Accidents of severity 1 happen at different ranges of humidity, from dry conditions to very humid. However, most of the accidents happen at average humidity percentage between 60 and 80. For accidents of level 4 there is and increase in the average humidity, happening between 70-77% most of the times. At Hum > 80% happen mostly accidents level 1. ND state is a particular case, with level 4 accidents of average humidity of 91%, CA and TX are not among the states with average humidity above 70%.

As for traffic signals, we have again our top states heading the others, but in a different order. CA comes in third place, it could be that traffic signal areas are not the mean place for accidents in CA.

## View at CA only
filter(US_accidents, State == "CA") %>%
    select(Amenity:Turning_Loop) %>%
    summary()

##   Amenity           Bump          Crossing        Give_Way      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:481563    FALSE:484676    FALSE:475619    FALSE:484525   
##  TRUE :3143      TRUE :30        TRUE :9087      TRUE :181      
##   Junction        No_Exit         Railway        Roundabout     
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:420390    FALSE:484495    FALSE:479325    FALSE:484694   
##  TRUE :64316     TRUE :211       TRUE :5381      TRUE :12       
##   Station           Stop         Traffic_Calming Traffic_Signal 
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:476064    FALSE:478652    FALSE:484655    FALSE:453662   
##  TRUE :8642      TRUE :6054      TRUE :51        TRUE :31044    
##  Turning_Loop   
##  Mode :logical  
##  FALSE:484706   
##

It looks like in CA, most of the accidents are reported at junctions more than at traffic signals.

TIME

Now we are going to have a quick view on the occurrence of accidents in time. The first idea and easiest option would be to look at the occurrence of accidents in specific moments, say grouped by Month (to represent time of the year) and by Hour (to obtain time of the day). For this we can use the package lubridate which makes such work easier:

## Months
US_accidents %>%
    select(Start_Time) %>%
    transmute(Month = month(Start_Time,
                            label = T,
                            locale = 'en_US.utf8')) %>%
    ggplot(aes(x = Month)) +
    geom_bar()

## Month/Year
US_accidents %>%
    select(Start_Time) %>%
    transmute(Month = month(Start_Time,
                            label = T,
                            locale = 'en_US.utf8'),
              Year = year(Start_Time)) %>%
    filter(Year != 2015) %>% # only 7 accidents overall
    group_by(Year, Month) %>%
    summarise(`No of accidents` = n()) %>%
    ggplot(aes(x = Month, y = `No of accidents`)) +
    geom_bar(stat = 'identity') +
    facet_grid(~Year)  +
    theme(axis.text.x = element_text(angle = 90))

## Per hour
US_accidents %>%
    select(Start_Time) %>%
    transmute(Hour = hour(Start_Time)) %>%
    mutate(Hour = factor(as.character(Hour),
                         levels = as.character(c(0:23)))) %>%
    ggplot(aes(Hour)) +
    geom_histogram(stat = 'count')

The amount of accidents reduces between April and July, as well as during night time. It seems that most of the accidents happen in the morning, between 7-8 hrs, and at early evening, between 16-17 hrs. In other view, less accidents are happening during the warmest months and during the coldest hours. That might answer the question why we did not find a trend in the occurrence of accidents related to Temperature or Wind Chill, but Humidity, considering that humidity raises at dusk and down, with the movements of the sun.

## Month and humidity
US_accidents %>%
    select(Start_Time, `Humidity(%)`) %>%
    mutate(Month = month(Start_Time,
                            label = T,
                         locale = 'en_US.utf8')) %>%
    group_by(Month) %>%
    summarise(Avg_Hum = mean(`Humidity(%)`, na.rm = T),
              Count = n()) %>%
    ggplot(aes(x = Month, y = Avg_Hum)) +
    geom_point(aes(size = Count))

## Hour and Humidity
US_accidents %>%
    select(Start_Time, `Humidity(%)`) %>%
    mutate(Hour = hour(Start_Time),
           Hour = factor(as.character(Hour),
                         levels = as.character(c(0:23)))) %>%
    group_by(Hour) %>%
    summarise(Avg_Hum = mean(`Humidity(%)`, na.rm = T),
              Count = n()) %>%
    ggplot(aes(x = Hour, y = Avg_Hum)) +
    geom_point(aes(size = Count))

As expected, we can easily observe the rise of the humidity at dusk and down (moments of most accidents), and its reduction between March and July (months with less accidents).

teotenn/kaggle-US_accidents

INTRODUCTION

NUMERIC VARIABLES

STATES

TIME