Download the csv file “accidents2019.csv” data https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data.
The data was originally downloaded from the UK Department of Transport Road Safety data base (“Road Safety Data - Accidents 2019”)
You can read further information about the data is in the website.
- How many accidents happened in 2019?
- What are the column names in the data frame?
- Report the date (day/month/year) and coordinates (longitude and latitude) of the accident in the 100th row of the data frame. (The row at the top of the data frame counts as the 1st row.)
- Columns containing information about the type of vehicles involved in the accident.
- The total number of casualties.
- The difference in the number of casualties between Lower Layer Super Output area (LSOA) E01032739 (City of London 001F) and E01033708 (Hackney 027G).
- The LSOA that saw the highest total number of casualties in 2019.
- What are the possible values that occur in the column Did_Police_Officer_Attend_Scene_of_Accident, and how often does each value occur?
- The value Did_Police_Officer_Attend_Scene_of_Accident = 1 indicates that a police officer attended the accident. What do the other value(s) in that column mean? (Consult the data web site to answer this question.)
- What fraction of accidents was attended by a police officer?
- What fraction of those accidents that happened on a weekday was attended by a police officer? How does this number compare to the corresponding fraction for accidents on weekends?
- Visualise the locations of all accidents by a scatter plot of Latitude vs Longitude. Annotate the axes, add a plot title, and increase the figure size to 10in by 10in.
- Create a similar scatter plot as in the previous question 1. but zoom in on a 2 by 2 degree area that includes Exeter/Devon, and choose a different color for accidents that happened in a rural area and in an urban area. Include a red marker that indicates the coordinates of Exeter/Devon (as per wikipedia).
- Are accidents at higher speed limits more likely to be fatal than at lower speed limits? Answer the question with an appropriate data visualisation and a short written summary.
To complete the following questions, you have to load the “Linnerud physical exercises data” from the scikit-learn package using the following commands:
from sklearn.datasets import load_linnerud
linnerud = load_linnerud()
- State the author’s last name and year of the study in which that data first appeared.
- Using the appropriate function from the scikit-learn package, fit a simple linear regression model with number of chinups (Chins) as the target variable, and “Weight” as the covariate. Report the fitted regression coefficients, and interpret the slope coefficient.
- Your lecturer is a middle-aged male, 170 pounds (“Weight”), 32 inch waist size (“Waist”), and a resting heart rate of 70 (“Pulse”)? How many chin-ups (“Chins”) do you think he can do? (Use linear regression in scikit-learn on the Linnerud data to answer the question.)
- What linear combination αwW +αcC +αpP of the physiological variables weight W, waist circumference C and pulse P has the highest possible variance among all possible linear transformations? State the weights αw, αc and αp rounded to 2 decimal places. (The variance maximisation is subject to the constraint α 2 w + α 2 c + α 2 p = 1 and you may assume that the appropriate scikit-learn function adheres to this constraint.)
- What might the interpretation of this one-dimensional representation of a person’s physiological factors be, and in what context could it be useful?
- What is the variance of the linear combination with αw = αc = αp = q 1 3 and how does this variance compare to the variance of the linear combination calculated in question 1 in this section?