Prompt crash detection is essential to reduce incident congestion and increase road safety. Conventional road crash detections have been performed with various Intelligent Transportation Systems (ITS) devices. However, the deployment of ITS devices along a corridor requires tremendous amount of money and time. To that end, I developed a cost-effective and reliable road crash detection model using XGBoost. The model was trained with the integrated dataset containing traffic, weather, road geometry, solar position, and population density. These data have both a direct and indirect impact on causing crashes, and most of the data are downloaded from the state DOT website. A total of six years crash and non-crash records were processed and used for the model train and test. The dataset for the model train and test includes 30 features and 54,643 observations.
We obtained and processed six years of crash data (occurred on NJ Route 3) and crash related data, including INRIX link speed, road geometry, weather, and population density. Below shows the download link.
- NJ road crash data (download)
- INRIX probe vehicle data for TMC links (download)
- Road geometry (download)
- NJ Route 3 vertical profile (download)
- Population density (download)
- Weather data (Teterborough Station) (download)
- Road Traffic Volume (NJCMS NJ3 only) (download)
Note that the weather data was pulled from the NOAA and the vertical profile was obtained from the Google Earth.
Data processing of the collected data is shown in NJ3_Crash_Data_Pre-processing.ipynb. Then, the processed datasets were merged together by time and location, which is shown in Join_CrashData_RoadLink_INRIX-TMC.ipynb
Overall, the features used for the model training are summarized below.
Category | Feature | Type | Description |
---|---|---|---|
Traffic | VOL | Continuous | Car volume (veh/hr) |
TRK | Continuous | Truck volume (veh/hr) | |
spd | Continuous | Speed (mph) | |
avg_spd_5min | Continuous | Avg. speed for the previous 5-min interval (mph) | |
avg_spd_10min | Continuous | Avg. speed for the previous 10-min interval (mph) | |
avg_spd_15min | Continuous | Avg. speed for the previous 15-min interval (mph) | |
dn1_spd_diff | Continuous | Speed difference between crash link and downstream link (mph) | |
up1_spd_diff | Continuous | Speed difference between crash link and upstream link (mph) | |
Std¬_5min | Continuous | STD of speed for the previous 5-min (mph) | |
Std_10min | Continuous | STD of speed for the previous 10-min (mph) | |
Std_15min | Continuous | STD of speed for the previous 15-min (mph) | |
cv_15min | Continuous | CV of speed for the previous 5-min | |
cv_10min | Continuous | CV of speed for the previous 10-min | |
cv_5min | Continuous | CV of speed for the previous 15-min | |
CAPLINK | Continuous | Link Capacity (veh/hr) | |
avg_spd_diff_up1dn1_5min | Continuous | Avg. speed difference between upstream and downstream link for the previous 5-min (mph) | |
Geometry | MP | Continuous | Milepost |
Lanes | Categorical | Number of Lanes | |
RAMP_prox | Categorical | Ramp Proximity | |
Grade | Continuous | Vertical Road Grade (%) | |
RAMP_shop | Categorical | Shopping Mall Entrance Existence (y/n) | |
Weather | Temp | Continuous | Temperature (F) |
Precip. | Continuous | Precipitation Depth (inch) | |
Snow D | Continuous | Snow Depth (inch) | |
Solar | solar_altitude | Continuous | Angular Solar Elevation (degree) |
solar_azimuth | Continuous | Angular Solar Azimuth (degree) | |
Time | Month | Categorical | Month |
hour | Categorical | Hour | |
Weekday | Categorical | Day of Week | |
Demographic | Pop_Dense | Continuous | Town Population Density (person/mi^{2}) |
We used XGBoost library for the supervised machine learning model. The model was trained using the following tuned parameters.
max_depth
: 6.0min_child_weight
: 5.0reg_lambda
: 1.0reg_alpha
: 0.0scale_pos_weight
: 1.0eval_metric
: auceta
: 0.5
The model train and test is shown in XGBoostModeling.ipynb
With the feature importance analysis, we have more insight into what features contributed, and on what scale, to mapping the test data to an output. A limitation of the feature importance plot is that it does not provide insight into the interactions between features, and it does not show how the results and features are correlated (e.g., negatively or positively). To address the shortcomings of the feature importance analysis, SHAP (Shapley Additive exPlanations) was adopted to interpret the output of the trained model. Below shows the primary features that affect positively (or negatively) on crash detection.
- Crash_Data_Preprossing.ipynb
- Get_weather-NJ.ipynb
- Merge_TMC-INRIX_Tables.ipynb
- Solar Position_Vertical Grade_Weather_and_Non-Crash Data.ipynb
- XGBoost Modeling.ipynb