Questions on Clustering Data and Script
Opened this issue · 11 comments
@mem48
Hi - thought I would test this out for some questions working through your script.
Is there a reason you don't use data.table?
Age Table - What are Mode1 and 2? (Most often and second most often?)
EPC table - Crr=Current and Ptn=Potential? How can potential_Mode be lower than Current_Mode e.g. E01000001
"Joining factors with different levels" Error - I have bluffed my way through properly understanding factors for too long - I think I might need you to explain them for me please!
First k-means graph - no clear elbow? Presume this is why you go on to dendograms?
line 141 has fit in it (cutree(fit,k=13)) - but fit is not created to line 149
Even if you jump forward to 149 and run the fit<-pvclust line - you get an error going back to 142 for the groups<-cutree...
Without this you can't create the groups to allocate to lsoa_house$hcluster in line 163
Any idea what needs tweeking to get those clusters allocated?
I think my logic is that we need to group areas using both energy and some characteristics - in order to then explain high areas by the known characteristics - and then unknown variations which we explore by other means? I don't believe that it is possible to predict energy usage through social and structural factors alone - but once we get to identify some different groups of high usage areas, we can then identify low areas with the same (non-energy) characteristics and contemplate what the differences might be... though here we are clearly missing out a lot of key factors such as urban/rural location, employment, social profiling etc.
This was the cluster map for the top pair of bar charts - still working on interpretting it but now shutting down have a good weekend
Clusters 2, 11, 5 and 12 (12 being off-gas grid areas I presume)
I have copied over the script I used for making 13 kmeans clusters to the joint folder.
I have also created a TempTables folder and in their is a copy of the table from the clustering so that we can keep the same cluster names for now.
I have also created a table with 12 clusters (using same methodology) as much easier to do multiple plot outputs (2x6,3x4) for 12 as opposed to 13!
This was the script I was using... I:\Github\Excess-Data-Exploration\Tim\RScripts\Clusters\Look at 13 clusters by LSOAC.R
Good questions. Partial answer to this one:
Is there a reason you don't use data.table?
data.table
is useful when speed is critical. In this situation it's not.
@timchatterton are you using the classifications in the clustering or just describing the clusters with the classifications? Also, we know rural-urban matters, but kmeans is numeric only so I'm going to add population density to the set of input datasets.