Questions on Clustering Data and Script

Question

Questions on Clustering Data and Script

Opened this issue 5 years ago · 11 comments

@mem48
Hi - thought I would test this out for some questions working through your script.

Is there a reason you don't use data.table?

Age Table - What are Mode1 and 2? (Most often and second most often?)
EPC table - Crr=Current and Ptn=Potential? How can potential_Mode be lower than Current_Mode e.g. E01000001

"Joining factors with different levels" Error - I have bluffed my way through properly understanding factors for too long - I think I might need you to explain them for me please!

First k-means graph - no clear elbow? Presume this is why you go on to dendograms?

line 141 has fit in it (cutree(fit,k=13)) - but fit is not created to line 149
Even if you jump forward to 149 and run the fit<-pvclust line - you get an error going back to 142 for the groups<-cutree...
Without this you can't create the groups to allocate to lsoa_house$hcluster in line 163
Any idea what needs tweeking to get those clusters allocated?

Answer 1 · 2019-05-03T15:08:49.000Z

To cluster including energy usage or not?

With Energy usage certainly seems to make for more interesting results - especially with gas!

Answer 2 · 2019-05-03T15:16:02.000Z

That second one above was without energy values, income, rooms or HHize....
This one is just without energy values - not much more elucidating...

Answer 3 · 2019-05-03T15:31:36.000Z

I think my logic is that we need to group areas using both energy and some characteristics - in order to then explain high areas by the known characteristics - and then unknown variations which we explore by other means? I don't believe that it is possible to predict energy usage through social and structural factors alone - but once we get to identify some different groups of high usage areas, we can then identify low areas with the same (non-energy) characteristics and contemplate what the differences might be... though here we are clearly missing out a lot of key factors such as urban/rural location, employment, social profiling etc.

This was the cluster map for the top pair of bar charts - still working on interpretting it but now shutting down have a good weekend
Clusters 2, 11, 5 and 12 (12 being off-gas grid areas I presume)

Answer 4 · 2019-05-08T15:13:59.000Z

I have copied over the script I used for making 13 kmeans clusters to the joint folder.
I have also created a TempTables folder and in their is a copy of the table from the clustering so that we can keep the same cluster names for now.
I have also created a table with 12 clusters (using same methodology) as much easier to do multiple plot outputs (2x6,3x4) for 12 as opposed to 13!

Answer 5 · 2019-05-08T16:27:54.000Z

These are the 13 clusters orderedf according to gas + electricity (and given letters)

Answer 6 · 2019-05-08T16:32:57.000Z

Here with 12 clusters - they don't look too different - so will focus mainly on 12 for now - I have also updated the tables in TempTables to include these letters

Answer 7 · 2019-05-08T17:11:38.000Z

And here are the division of clusters by LSOA Classification (Super Groups)
(annoyingly I can't work out how to neatly/quickly force ggplot to do a full axis of A to L without missing out the no data clusters)

Answer 8 · 2019-05-08T17:12:45.000Z

And here are the 23 groups

Answer 9 · 2019-05-08T17:16:09.000Z

This was the script I was using... I:\Github\Excess-Data-Exploration\Tim\RScripts\Clusters\Look at 13 clusters by LSOAC.R

Answer 10 · 2019-05-10T08:52:39.000Z

Good questions. Partial answer to this one:

Is there a reason you don't use data.table?

data.table is useful when speed is critical. In this situation it's not.

Answer 11 · 2019-05-13T09:48:55.000Z

@timchatterton are you using the classifications in the clustering or just describing the clusters with the classifications? Also, we know rural-urban matters, but kmeans is numeric only so I'm going to add population density to the set of input datasets.