scikit-hep/hepstats

Adding 'Energy Test' two-sample test

dylanjaide opened this issue ยท 18 comments

Following discussions with @chrisburr and others, we propose adding the Energy Test to hepstats. This is a two-sample hypothesis test currently used at LHCb to observe local CP violation in 3- and 4-body decays ([1], [2], [3]).

Chris' suggestion was to use the existing C++ implementation (see here, although some minor extra features have been added to other versions since this one) and to add python bindings for it, along with a python executable for command line usage. There also exists a python implementation, although it's slower & may not be preferable.

@Dylan-JW Excellent, I have learned about a new statistic test! This would be a great addition to hepstats.

Chris' suggestion was to use the existing C++ implementation (see here, although some minor extra features have been added to other versions since this one) and to add python bindings for it, along with a python executable for command line usage. There also exists a python implementation, although it's slower & may not be preferable.

I don't have a lot of experience with python bindings, but this seems reasonable. Would you move also the C++ part to hepstats?
What about a Numba accelerated implementation?

Hi all, sounds great.

Concerning the nice suggestion on using Numba with pure Python. Yep, that might be good enough given that the algorithm probably isn't that big hence preparing bindings with PyBind11 might be an overkill. May I suggest that you discuss on how to go about this in https://gitter.im/Scikit-HEP/community as you will then get lots of expert help? Just a thought.

I think this would be a nice addition as it is a generally applicable statistical test. It would also be nice to generalise the code to work for any number of dimensions.

On big change would be that this would introduce a small C++ component which would end up requiring the hepstats CI and deployment to be changed. Personally I don't see this as too much of a problem thanks to the work which has been done for boost-histogram and awkward1 but I'm also not responsible for hepstats and I understand if people would rather keep it pure Python.

Actually it would be interesting to see if numba could be used to optimise this without falling back to C++, especially as that would allow the same implementation to run on GPUs. I suspect wouldn't be fast enough considering hundreds of CPU years have been used running the handful of lines in this loop but it would be good to check as it's a perfect example of where numba should shine.

EDIT: I see while writing this similar suggestions have been made! ๐Ÿ˜†

Hi all, thanks for the feedback. One of our group has been working on a version (in a private repo) using numba & CUDA, although I believe it's not yet finished & it hasn't been benchmarked against the C++ implementation yet. Also, I agree that generalising the code to work with any number of dimensions would be worth doing. Furthermore, if a python implementation using numba is the way forward, could it also be worth adding the ability to take inputs from sources other than plaintext files (eg. pandas dataframes)?

Hi all,

I implemented a numba version using the GPU a while back. I've recently added a parallel implementation for CPU using numba. At present you can choose whether to use CPU or GPU using a flag. This version is on the LHCb GitLab. I could probably mirror it to github but regardless here is the link and I've added the people on this thread.
Link to repo

As previously mentioned it hasn't been bench marked for performance yet but the C++ version has been used to test the results.An issue somewhat related to this is that there seems to be problems with floating point accuracy in that the CPU numba, GPU numba and C++ versions which all give different test statistic values as the dataset increases. The difference is O(10^-7) and lower but of course any difference could lead to incorrect significance.

Indeed that's worrisome if the floating point accuracy is different between versions especially when you want very low p-values.
I guess you could also encounter this problem when testing the implementation into two different machines, have you tried?

Indeed that's worrisome if the floating point accuracy is different between versions especially when you want very low p-values.
I guess you could also encounter this problem when testing the implementation into two different machines, have you tried?

I have not tried this yet but I will. For a sample size of 2 million the difference between the python version and the C++ becomes larger with agreement at 4 significant figures as opposed to 7 for a sample size of 200k but all the test statistic values are different. I'm now concerned that maybe there might be issues with the loops in the function since I would have expected consistent differences i.e always differ at 7 significant figures since we use double precision (float64) throughout and rounding errors wouldn't be this large?

Any news about this :) ?

Hi all, are there any news on this? It was mentioned recently that a pure Python implementation of this is available (by Abhjit), in case this here is stalled.

Better to ping Abhijit explicitly, then, even if via email ...

Hi,

I never got around to finding out the cause of the numerical differences between this and the C++ version. However it sounds like this new Python version seems to work. Do you know if Abhijit has compared it with the C++ version to check for similar numerical issues?

Hi all, I believe there's now 3 separate Python implementations floating around:

So we sort of have our pick of the bunch - probably whichever we think will perform the best, I'd guess? I'm happy to send Abhijit an email to discuss further if we wish (though their repository does describe it as "not the most optimal implementation", at least in its current form).

Hi All, It would also be good to include not just the Energy test but also other un-binned methods too, if available (e.g. Mixed sample). My code should scale with number of dimensions but does not scale well with sample size. It does not yet include the feature presented in this paper by W. Barter et al. I say my implementations of the two methods are not the most "optimal" because I have not compared it against other implementations (I will have to first check the performance and see if there are parts where I can improve, but have no immediate plans for doing this). By the way (@dylanjaide ), John's link above is broken, I cannot open it.

I think Giulio's version also doesn't implement the scaling method you mention - however it looks to me like John's does. And yeah, the poor (O(n^2)) scaling with sample size is one of the main issues with this method, and why we think it'd be ideal to have an implementation for scikit-hep that also works on GPUs. With all of this in mind, it seems to me that John's version is likely the best starting point, as long as we can work out this issue with the floating point precision.

(also, re. the link to John's repo: the link is correct, I just forgot that it's a private repo, sorry!)

Hi @abhijitm08,

I've made the repo public on GitLab so you should be able to view it.

Indeed, I agree. There is one advantage though of Giulios implementation: as it seems to me, it is fully vectorized and does not rely actually on for-loops, which is elegant.

That aside, I think @jcob95 implementation seems favorable given the jit. Maybe we can try with long doubles to gain some extra precision and understand which implementation is actually "correct"?

@mayou36 Had to use explicit for loops because the kernel function ("phi") is actually symmetric matrix atleast for the 2 same-sample terms of the test statistic. So you need to only evaluate lower or upper side of the triangle and not both, saving computing time for large datasets. I actually do this and it looks like @jcob95 does it too. For the exponential form of the kernel function, the cross term will also be symmetric (so we can extend it to this term too for this choice of kernel function, which should also speed things up). I think Giulio's code calculates the full matrix, this will not be too performant for large datasets.

Also want to point out that the best rejection power of this test is obtained when sigma used in kernel function definition is actually pdf dependent. So will have to provide an option that takes a callable to calculate pdf-dependent sigma.

Hi all, coming back to this: it seems that we have a few implementations around and solvable problems ahead. I guess it is just time that someone tackles it, as I think it would be a good addition here.

Maybe it could even be a student project.

I can gladly review and help with implementation specificalities that concern speed, but I don't know enough about the test nor the usage of it in the wild.

Has someone the interest to go actively forward to PR this?