Elbow plot variance calculation
Jeff87075 opened this issue · 3 comments
Hi, I want to ask how the following formula for calculating the variance (that will be used in the elbow plot) is derived?
c_wss <- 0
for(j in seq_along(clustering)){
if(sum(clustering == j) > 1){
c_wss <- c_wss + (nrow(data[clustering == j, , drop = FALSE]) - 1) *
sum(apply(data[clustering == j, , drop = FALSE], 2, stats::var))
}
}
I understand that the sum()
part is calculating the within sum of squares but why does it have to be multiplied by what I assume is the degrees of freedom with the nrow() - 1
? Thanks a lot!
Mm, I'm trying to remember. I would assume the main idea here was to take a weighted version (so larger clusters contributing more), I'm just not sure where the minus 1 is coming from, and whether this weighting with the number of datapoints is necessary in the first place...
There certainly might be a mistake in this code, because it actually is not working that well, and typically when using FlowSOM we handpick the number of metaclusters rather than using this automated approach.
Ah I see, an automated approach certainly has its limitations. On the topic of the SOM algorithm, since I see that the flowSOM package has its own codes for performing the SOM, can I also ask what are the major differences between the SOM performed in flowSOM versus the SOM algorithm introduced by the kohonen package?