creds2/Excess-Data-Exploration

Questions on Clustering Data and Script

Opened this issue · 11 comments

@mem48
Hi - thought I would test this out for some questions working through your script.

Is there a reason you don't use data.table?

Age Table - What are Mode1 and 2? (Most often and second most often?)
EPC table - Crr=Current and Ptn=Potential? How can potential_Mode be lower than Current_Mode e.g. E01000001

"Joining factors with different levels" Error - I have bluffed my way through properly understanding factors for too long - I think I might need you to explain them for me please!

First k-means graph - no clear elbow? Presume this is why you go on to dendograms?

line 141 has fit in it (cutree(fit,k=13)) - but fit is not created to line 149
Even if you jump forward to 149 and run the fit<-pvclust line - you get an error going back to 142 for the groups<-cutree...
Without this you can't create the groups to allocate to lsoa_house$hcluster in line 163
Any idea what needs tweeking to get those clusters allocated?

To cluster including energy usage or not?
image
image

With Energy usage certainly seems to make for more interesting results - especially with gas!

That second one above was without energy values, income, rooms or HHize....
This one is just without energy values - not much more elucidating...
image

I think my logic is that we need to group areas using both energy and some characteristics - in order to then explain high areas by the known characteristics - and then unknown variations which we explore by other means? I don't believe that it is possible to predict energy usage through social and structural factors alone - but once we get to identify some different groups of high usage areas, we can then identify low areas with the same (non-energy) characteristics and contemplate what the differences might be... though here we are clearly missing out a lot of key factors such as urban/rural location, employment, social profiling etc.

This was the cluster map for the top pair of bar charts - still working on interpretting it but now shutting down have a good weekend
Clusters 2, 11, 5 and 12 (12 being off-gas grid areas I presume)
image

I have copied over the script I used for making 13 kmeans clusters to the joint folder.
I have also created a TempTables folder and in their is a copy of the table from the clustering so that we can keep the same cluster names for now.
I have also created a table with 12 clusters (using same methodology) as much easier to do multiple plot outputs (2x6,3x4) for 12 as opposed to 13!

These are the 13 clusters orderedf according to gas + electricity (and given letters)
image

Here with 12 clusters - they don't look too different - so will focus mainly on 12 for now - I have also updated the tables in TempTables to include these letters
image

And here are the division of clusters by LSOA Classification (Super Groups)
(annoyingly I can't work out how to neatly/quickly force ggplot to do a full axis of A to L without missing out the no data clusters)
image

And here are the 23 groups
image

This was the script I was using... I:\Github\Excess-Data-Exploration\Tim\RScripts\Clusters\Look at 13 clusters by LSOAC.R

Good questions. Partial answer to this one:

Is there a reason you don't use data.table?

data.table is useful when speed is critical. In this situation it's not.

mem48 commented

@timchatterton are you using the classifications in the clustering or just describing the clusters with the classifications? Also, we know rural-urban matters, but kmeans is numeric only so I'm going to add population density to the set of input datasets.