kno10/rust-kmedoids

How to achieve sample weighting?

jjyyxx opened this issue · 6 comments

jjyyxx commented

Thanks for this awesome library. It's fast, powerful, and has already helped me a lot.

The sklearn KMeans algorithm has a optional sample_weight option, specifying weights for each observation. I'm wondering if this functionality is available in your library as well, e.g., by passing some sort of weighted dissimilarities? Or fork this crate and extend an algorithm?

P.S.: Currently, I'm not involved in clustering research; I'm simply using it for my purposes. Hence, I'm uncertain if weighting is even well-defined for k-medoids algorithms. So, apologize for any disturbance.

kno10 commented

You can simply multiply sample weights into your distance matrix here, such that d'(x,y) = w_x * w_y * d(x,y).
So there is no need to modify the Rust code.
It may be nice to have this in the Python wrapper, but it should require little more (a if sample_weight is not None, a shape assertion) than

X = X * sample_weight * sample_weight[:, np.newaxis]

to multiply sample weights on both rows and columns of the distance matrix.

I am leaving this open here because we probably should add this to the documentation. Feel free to make a pull request for the Python wrapper in https://github.com/kno10/python-kmedoids/ if you would like this to be in the API.

jjyyxx commented

@kno10 Thanks for your suggestion! I tried it out, but it does not quite work as expected.

Considering k-means objective function, it could be expressed as

$$L(Z, A) = \sum_{i=1}^{N}{\lVert x_i - z_j\rVert^2}, \quad\text{where } j = A(i)$$

where $Z$ refers to the centroids, and $A$ refers to the assignment. The weighted version is, then,

$$L(Z, A) = \sum_{i=1}^{N}{w_i\lVert x_i - z_j\rVert^2}, \quad\text{where } j = A(i)$$

If I'm not mistaken, k-medoids objective function works similarly

$$L(M, A) = \sum_{i=1}^{N}{\lVert x_i - x_{j}\rVert} = \sum_{i=1}^{N}{D_{ij}}, \quad\text{where } j = M(A(i))$$

where $M$ refers to medoid indices, and $D$ is the dissimilarities matrix. The solution you suggest implies a objective like

$$L(M, A) = \sum_{i=1}^{N}{D'_{ij}} = \sum_{i=1}^{N}{w_i w_j D_{ij}}, \quad\text{where } j = M(A(i))$$

I understand that in order to keep and $D'$ valid symmetric pairwise distance matrix, $D'$ has to be defined as $ww^{T}D$, but this leads to undesirable result. Consider, for certain samples, its $w_j$ is much smaller than other samples, then it's very likely to be selected as the medoid, since even if $D_{ij}$ is larger, small $w_j$ makes $D'$ smaller overall.

Below is an example. The red dots are centroids or medoids. Considering weights in k-means improves the final metric for my specific task (statistically over a dataset), but considering weights as such in k-medoids significantly worsen the performance (by favoring outliers).

Not weighted Weighted
k-means k-means (not weighted) k-means (weighted)
k-medoids k-medoids (not weighted) k-medoids (weighted)
kno10 commented

Ok, I see. Then what would be the desired weighting scheme? You could encode $\max { w_i, w_j } d(i,j)$ in the matrix, too. The implementation does not enforce a symmetric matrix, so you could also input an asymmetrically weighted matrix. But it needs to be double-checked that the distance function consistently uses d(medoid, sample). Then you could use $w_j d(i,j)$ as weighting, if that is what you intend.

jjyyxx commented

@kno10 Thanks for the suggestion. This indeed works, but only after small tweaks to the source code. When the dissimilarity matrix is symmetric, everything works fine; but when it turns assymetric, the order of sample and medoid becomes critical. The algorithm fasterpam I used, unfortunately, has several places of inconsistent usage. I patched this (see below) and now everything works as expected and I got better performance than k-means.

diff --git a/src/fasterpam.rs b/src/fasterpam.rs
index 3749a06..1ea4f97 100644
--- a/src/fasterpam.rs
+++ b/src/fasterpam.rs
@@ -232,7 +232,7 @@ where
 	// Improvement from the journal version:
 	let mut acc = L::zero();
 	for (o, reco) in data.iter().enumerate() {
-		let djo = mat.get(j, o);
+		let djo = mat.get(o, j);
 		// New medoid is closest:
 		if djo < reco.near.d {
 			acc += L::from(djo) - L::from(reco.near.d);
@@ -316,7 +316,7 @@ where
 				reco.near = DistancePair::new(b as u32, N::zero());
 				return L::zero();
 			}
-			let djo = mat.get(j, o);
+			let djo = mat.get(o, j);
 			// Nearest medoid is gone:
 			if reco.near.i == b as u32 {
 				if djo < reco.seco.d {

Test code

use rand::prelude::*;

use ndarray::prelude::*;
use ndarray_rand::RandomExt;

use kmedoids::{fasterpam, random_initialization};

fn euclidean_distances(a: &ArrayView2<f64>) -> Array2<f64> {
    let n = a.nrows();
    Array2::from_shape_fn((n, n), |(i, j)| {
        let x = a.index_axis(Axis(0), i);
        let y = a.index_axis(Axis(0), j);
        distance(&x, &y)
    })
}

fn distance(x: &ArrayView1<f64>, y: &ArrayView1<f64>) -> f64 {
    let v = x - y;
    v.dot(&v).sqrt()
}

fn main() {
    const N: usize = 64;
    const D: usize = 2;
    const K: usize = 3;

    let mut rng = rand::rngs::StdRng::seed_from_u64(0);

    // Test data generation
    let (sample, weight) = {
        let uniform = rand::distributions::Uniform::new(0., 1.);
        let sample = Array2::<f64>::random_using((N, D), uniform, &mut rng);
        let mut weight = Array1::<f64>::random_using(N, uniform, &mut rng);
        weight /= weight.sum();
        weight *= N as f64;
        (sample, weight)
    };

    // K-medoids
    let diss = euclidean_distances(&sample.view());
    let diss = diss * weight.slice(s![.., NewAxis]);  // Apply weights (sample, medoid)
    let mut medoids = random_initialization(N, K, &mut rng);
    let (loss, labels, _, _): (f64, _, _, _) = fasterpam(&diss, &mut medoids, 100);
    dbg!(loss);

    // Check loss, before the fix, loss will be inconsistent with loss2 and loss3
    let loss2: f64 = itertools::izip!(sample.outer_iter(), labels.iter(), weight.iter())
        .map(|(x, &l, &w)| {
            let medoid = medoids[l];
            let y = sample.index_axis(Axis(0), medoid);
            distance(&x, &y) * w
        })
        .sum();
    let loss3: f64 = labels.iter().enumerate()
        .map(|(i, &l)| {
            let medoid: usize = medoids[l];
            diss[[i, medoid]]
        })
        .sum();
    assert_eq!(loss, loss2);
    assert_eq!(loss, loss3);
}
kno10 commented

TODO: to merge this, all the variants need to be checked to use consistent access to the matrix. When doing this, we can also allow asymmetric matrices, although this has more potential for misuse (by passing a data matrix instead of a distance matrix, very few data matrixes will be square).

kno10 commented

I made some of the changes in 6a82530 but have not added any tests for asymmetric use yet. Some methods will be able to support this, others will not (e.g., Silhouette itself is defined based on pairwise distances of samples, and will not support asymmetric use, but the medoid Silhouette can).
I will see if I can have a student assistant add tests for the asymmetric cases eventually.