SofieVG/FlowSOM

Elbow plot variance calculation

Jeff87075 opened this issue · 3 comments

Hi, I want to ask how the following formula for calculating the variance (that will be used in the elbow plot) is derived?

  c_wss <- 0
  for(j in seq_along(clustering)){
    if(sum(clustering == j) > 1){
      c_wss <- c_wss + (nrow(data[clustering == j, , drop = FALSE]) - 1) *
        sum(apply(data[clustering == j, , drop = FALSE], 2, stats::var))
    }
  }

I understand that the sum() part is calculating the within sum of squares but why does it have to be multiplied by what I assume is the degrees of freedom with the nrow() - 1? Thanks a lot!

Mm, I'm trying to remember. I would assume the main idea here was to take a weighted version (so larger clusters contributing more), I'm just not sure where the minus 1 is coming from, and whether this weighting with the number of datapoints is necessary in the first place...
There certainly might be a mistake in this code, because it actually is not working that well, and typically when using FlowSOM we handpick the number of metaclusters rather than using this automated approach.

Ah I see, an automated approach certainly has its limitations. On the topic of the SOM algorithm, since I see that the flowSOM package has its own codes for performing the SOM, can I also ask what are the major differences between the SOM performed in flowSOM versus the SOM algorithm introduced by the kohonen package?