kno10/rust-kmedoids

Add a BanditPAM, BanditPAM++ implementation

Closed this issue · 3 comments

Dear rust-k-mediods team,

This is amazing implementation and many thanks for this. I noticed an even faster one, which was based on fastPAM but 4 times faster. It is not an exact solution but approximate solution. However, it is as accuracy as the exact solution in practice. Check here: https://proceedings.neurips.cc/paper/2020/file/73b817090081cef1bca77232f4532c5d-Paper.pdf

Any thoughts and potential to also implement this? It would be even useful for very large dataset.

Thanks,

Jianshu

kno10 commented

Its not faster in our experiments.
See motiwari/BanditPAM#175

Because we do not have distance functions, but currently operate on precompupted distance matrixes only, this is currently out-of-scope for this package, but this may, of course, change in the future.
Because the available implementation of BanditPAM performed worse for us (despite being already C++), it is not of a high priority.

We would appreciate the contribution of such algorithms as long as they do not add major external dependencies (to avoid dependency hell). I would suggest to first add:

  • CLARA (which gives you FastCLARA and FasterCLARA for free, since CLARA simply calls PAM on samples), and
    #5
  • CLARANS and FastCLARANS (which removes the need for one sampling from CLARANS)
    #6

These are quite similar but use uniform sampling, although my impression that FasterPAM will still outperform these if you want high accuracy. I do not think the BadingPAM authors ever compared to these baseline method...

Note there is a newer version of BanditPAM to be published at NeurIPS 2023.
The preprint is here: https://arxiv.org/pdf/2310.18844.pdf

But I am not keen of that paper either, because it

  1. does not compare to relevant baselines including CLARANS and FastCLARANS, or perform any reliable runtime comparison at all in the latest version (possibly because it did not look good to comparing to ours).
  2. contains misclaims, such as "these usually sacrifice clustering quality; such algorithms include CLARA [11], CLARANS [21], and FastPAM [25]. While these algorithms scale subquadratically in n, they return substantially worse solutions than PAM [28]." when FastPAM1 produce the same results as PAM, and FasterPAM has the same fixpoints as PAM (and hence of the same quality).
    also they claim "We note that Theorem 1 has been observed in alternate forms, e.g., as the “FastPAM1” trick, in prior work [28]. However, to the best of our knowledge, we are the first to provide a formal statement and proof of Theorem 1", when their 2023 proof is essentially Equations 5 and 6 from our 2021 journal version https://doi.org/10.1016/j.is.2021.101804 that was known to them but they did not cite. Even if that still was not formal enough to them, they should have acknowledged this.

Hence I am a bit reluctant to promote their work by implementing it.