creds2/Excess-Data-Exploration

Basic Linear Modelling

Opened this issue · 7 comments

image

OK - I have got the plots that I sort of wanted!
Basically for this version - I have created a linear model for the top 5 structural factos for gas, electricity, car energy (average over all huseholds) and catr energy (Only HH with cars)

I have plotted up modelled vs measured - and added a 1:1 line - anything to righ of line could be excess? I have also colored in red where measured is >25% above modelled.

Basically you get a good linear modelling for gas and electric - it doesn't work so well for the MOT data!

Any thoughts welcome.

These were the r-squared - 0.6-0.7 for gas and electric - MUCH MUCH lower on the cars (this was structural factors not social ones - but will discuss with Sally and Jillian where they got to)

Electricity   R2
  Rooms 0.4316
  % Gas Heating 0.6644
  %Elec Heating 0.6928
  %Flats 0.7029
  Built 1930s 0.7064
     
Gas    
  Rooms 0.5146
  Flats 0.5721
  1930-39 0.5948
  1900-18 0.6064
  % Gas Heating 0.6167
     
Car (All HH)    
  Pop Density 0.06699
  Cars per HH 0.1155
  %HH without cars 0.118
  PT time to Town Centre 0.1179
  % Active to Work 0.1191
     
Cars (HH with Cars)    
  Pop Density 0.02138
  %HH without cars 0.02189
  % Cycle to Work 0.02194
  Distance to Work 0.0226
  Cars per HH (with Cars) 0.02932

Interesting stuff @timchatterton, many thanks for sharing these results. More discussion to follow no doubt.

mem48 commented

I've committed a small fix, but I can't reproduce your plots as the code is missing

mem48 commented

Also, can you explain how you chose your variables? E.g. why %cycling rather than %driving to work?

Hi - I clearly hadn't save the right version of the script to github - the gas issue was spotted quite quickly and sorted out - and the plots were added to the bottom of the code - I believe this versionis now updated.

THe variabls were taken from the top 5 most important (structural) according to the XGBoosts

mem48 commented

Hi @timchatterton, I was hoping to talk to you at the meeting, but I was off sick. Fortunatly I'm much better now. I wanted to draw your attention to some experiments in modelling at https://github.com/creds2/Excess-Data-Exploration/blob/master/Modeling_Summary.md I was able to get much better results for the driving, and comparable results for Gas and Electric I used an approach of taking the single most important variable, then finding what correlated with the residuals, and replete.

It gives me a slightly different selection of variables. But you can see the "logic" is similar in both your and my results. The driving result is very strongly correlated (r squared of 0.85) but I'm getting some s-curved results which suggest I'm not correctly handling the non-linearity correctly, any suggestions?