How is the JSD calculated in the final paper?
Justin-Tan opened this issue · 3 comments
Hi authors,
How do you reliably compute the Jensen-Shannon divergence between the continuous mass distributions passing/failing the threshold? As far as I understand, the KL Divergence and by extension the JS divergence b/w continuous distributions are intractable and hard to estimate reliably.
I looked through the code base but found it mildly confusing. From what I can gather, you discretize the distributions somehow and use an estimator for the entropy of the discretized distribution? But how do you calculate the cross entropy as well? A quick rundown would be very helpful, as this would be a very useful metric to quantify the extent of decorrelation to a given pivot variable.
Cheers,
Justin
Hi @Justin-Tan,
The simplest computation is here:
Lines 600 to 617 in 0ef7e34
Basically we first get the mass_pass
and mass_fail
numpy arrays of mass values. These are turned into binned, normalized mass distributions spec_ohe_pass_sum
and spec_ohe_fail_sum
.
We take the average M
of these two distributions. Then we compute the two KL divergences and average them to get the JS divergence.
Thanks for the response! That makes sense, I didn't realize the scipy.stats.entropy
function calculates the relative entropy, though its obvious in hindsight ... :\
By the way, did you find the metric sensitive to the choice of binning at all?