You can find a full analysis at https://agiq.github.io/bikesharing/ A bike-sharing system is a service in which it made bikes available for shared use to individuals on a short-term basis for a price or free. The company is struggling to cope with the current market conditions. You feel that your business is being crippled by the lockdown, and that it will take a long time to recover once the lockdown is over. You feel you need to do something to prepare yourself for when things get back to normal. The consulting company wants to understand the factors affecting the demand for these shared bikes in the American market.
There are two analysis. This write up is based on bike.ipynb which perform less optimum compared to bike_dummy_vars.ipynb where I added more dummy variables. The objective is to find which variable(s) is/are significant in predicting bike rental 'cnt'. First, perform exploratory data analysis.
- There are 730 rows and 16 columns in this data set.
- The data set has 8 categorical data and 4 numerical data after data type conversion.
- Fortunately, there is 0 missing data and 0 duplicate values.
- The Target variable is 'cnt'.
- Binary columns (with 0 & 1) are: yr, holiday, and working day
- Columns that can turn into dummies variable is: weathsit
- Columns that can turn into categorical are: 'season','mnth','yr','holiday','weekday','workingday','weathersit','day'
Next, perform data visualization using boxplot on categorical data and use heatmap on numerical data.
Analysis of categorical data:
- Falls has the highest rental. Summer is the next highest. Both seasons are when kids are out of school. People spend time with them families and friends.
- Month: More bike rentals occur between May and September
- Year: 2019 bike rental is higher than 2018 suggesting the trend it up.
- Holiday: non-holiday (0) has a wider range compared to holiday
- Weekday: Wednesday and Thursday appear to have more bike rental
- Workingday: inconclusive – they are about the same
- Weathersit: Bike rental is high when it is Clear, Few clouds, Partly cloudy, Partly cloudy
- Day: It seems that bike rental occurs more in the middle of the month
Analysis of numerical data showing correlation.
- We drop 'instant' as it provides no real value in model training. 'instant' is assigned to each row.
- Both 'registered' and 'casual' are highly correlateed to 'cnt' the target variable. They are dropped.
- 'atemp' and 'temp' are are closely correlated so we drop 'atemp'.
Next, do a groupby of each numeric variable with 'cnt' and plot them. 'temp' vs 'cnt' show a linear uptrend. 'hum' and 'windspeed' were sideway as they don't appear to add much value to the model.
As part of data visualization, conduct a stacked histogram plot to show which features in each categorical data are highest. This is to provide another confirmation of our observeration above.
For model training, validating, and evaluating, we perform simple linear regression , multilinear regression and RFE training. Once completed, we evaluate the trained model. The objective is to show manual add/remove variable, have the computer automagically select variables, and perfom a hybrid approach.
- Simple Linear Regression (SLR): temp displayed a linear red line upward. The residual analysis depicted via histogram is normally distributed. There is no identifiable pattern found.
- Multi Linear Regression (MLR): first add one variable, traing the model and reiterate. In this process, observe the R-squared and P-value. The second method is to add all variables. Then, remove the highest P-value one at a time. Utilize the corresponding high VIF value to identify the next highest variable to drop. Only drop one at a time. Manual model training is show the following variables contribute to higher bike rental: season, yr, holiday, weekday, weathersit, temp, and windspeed. With that we ended up with 77.8% R-square and 77.9% r2_score. In bike_dummy_vars.ipynb, I added more dummy variable. The r2_score didn't change but the R2-squared increased from 77.8% to 79.9%.
- Use Recursive Feature Elimination (RFE) to automate which features to include in training. After dropping unwanted variables, the R-squared is 80.9% and the r2_score is 78.9%. After I added more dummy variables in bike_dummy_vars.ipynb, R-squared for RFE jumped from 80.9% to 83% and r2_score increased to 80.3% from 79.9%.
- Converting categorical variables into dummy variables increased the r2_score.
- Manual selection of variables for training is a tedious job. The automated feature selection is nice, but the computer is not a subject matter expert in your industry. Therefore, a hybrid approach is recommended.
For each of the training method(SLR, MLR, and RFE), plot the error terms, and get r2_score. Also, plot y_test against y_test_pred on a scatter plot.
Use the latest version of the following libraries:
- pandas - data structures and operations for manipulating numerical tables
- numpy - scientific computing in Python
- matplotlib.pyplot - graph and chart
- seaborn - graph and chart
Below are machine learning libraries
- sklearn
- sklearn.model_selection
- sklearn.preprocessing MinMaxScaler
- sklearn.metrics, r2_score
- sklearn.metrics, mean_squared_error
- statsmodels.api statsmodels.stats.outliers_influence, variance_inflation_factor
Use datetime to work with date and time
- datetime
Created by [@agiq] - feel free to contact me!