A simple time series regression to understand the different steps of modeling
Link to the article where I published this analysis!
Linear regression is a very common model used by Data Scientist. An outcome or target variable is explained by a set of features. There is a case where the same variable is collected over time and we used a sequence of measurements of that variable made at regular time intervals. Welcome to Time Series. One difference from standard linear regression is that the data are not necessarily independent and not necessarily identically distributed. Working with time series can be frustrating as it implies that you have to find a correlation between the lag or errors of any previous prediction of the value and itself. Also, the ordering matters and changing the order will change the meaning of the data. Due to its complexity, Data Scientist got lost sometimes in the process of times series analysis. In this blog, I am going to share a full time series analysis guided by one of the well known Data Science methods: OSEMIN.
The visual above shows the methodology used in my study from gathering the data to drawing conclusions. The data used for this analysis contained the date and amount of 1461 daily accidents in the UK from January 1st, 2014 to December 31, 2017. I used a dataset from from kaggle for this exercise. I downloaded an CSV file and used a popular python code 'pd.read_csv' to store it into a Data Frame. No other independent variables were considered in this analysis as I am focused on the time series. The main purpose of this study is to explain the different steps of a full data science project. Other objectives are to find out if the number of accidents in a day is dependent of the number of accidents in any given day. The 3 questions that the study is seeking to answer are: What is the relation between the amount of accident on a current day and the day prior? Is there any pattern that can help predict (or prevent) the amount of accident in UK on any given day? Is the month of the year or day of week related to the number of accident during that month?