How to achieve sample weighting?
jjyyxx opened this issue · 6 comments
Thanks for this awesome library. It's fast, powerful, and has already helped me a lot.
The sklearn KMeans algorithm has a optional sample_weight
option, specifying weights for each observation. I'm wondering if this functionality is available in your library as well, e.g., by passing some sort of weighted dissimilarities? Or fork this crate and extend an algorithm?
P.S.: Currently, I'm not involved in clustering research; I'm simply using it for my purposes. Hence, I'm uncertain if weighting is even well-defined for k-medoids algorithms. So, apologize for any disturbance.
You can simply multiply sample weights into your distance matrix here, such that d'(x,y) = w_x * w_y * d(x,y)
.
So there is no need to modify the Rust code.
It may be nice to have this in the Python wrapper, but it should require little more (a if sample_weight is not None
, a shape assertion) than
X = X * sample_weight * sample_weight[:, np.newaxis]
to multiply sample weights on both rows and columns of the distance matrix.
I am leaving this open here because we probably should add this to the documentation. Feel free to make a pull request for the Python wrapper in https://github.com/kno10/python-kmedoids/ if you would like this to be in the API.
@kno10 Thanks for your suggestion! I tried it out, but it does not quite work as expected.
Considering k-means objective function, it could be expressed as
where
If I'm not mistaken, k-medoids objective function works similarly
where
I understand that in order to keep and
Below is an example. The red dots are centroids or medoids. Considering weights in k-means improves the final metric for my specific task (statistically over a dataset), but considering weights as such in k-medoids significantly worsen the performance (by favoring outliers).
Not weighted | Weighted | |
---|---|---|
k-means | ||
k-medoids |
Ok, I see. Then what would be the desired weighting scheme? You could encode
@kno10 Thanks for the suggestion. This indeed works, but only after small tweaks to the source code. When the dissimilarity matrix is symmetric, everything works fine; but when it turns assymetric, the order of sample
and medoid
becomes critical. The algorithm fasterpam
I used, unfortunately, has several places of inconsistent usage. I patched this (see below) and now everything works as expected and I got better performance than k-means.
diff --git a/src/fasterpam.rs b/src/fasterpam.rs
index 3749a06..1ea4f97 100644
--- a/src/fasterpam.rs
+++ b/src/fasterpam.rs
@@ -232,7 +232,7 @@ where
// Improvement from the journal version:
let mut acc = L::zero();
for (o, reco) in data.iter().enumerate() {
- let djo = mat.get(j, o);
+ let djo = mat.get(o, j);
// New medoid is closest:
if djo < reco.near.d {
acc += L::from(djo) - L::from(reco.near.d);
@@ -316,7 +316,7 @@ where
reco.near = DistancePair::new(b as u32, N::zero());
return L::zero();
}
- let djo = mat.get(j, o);
+ let djo = mat.get(o, j);
// Nearest medoid is gone:
if reco.near.i == b as u32 {
if djo < reco.seco.d {
Test code
use rand::prelude::*;
use ndarray::prelude::*;
use ndarray_rand::RandomExt;
use kmedoids::{fasterpam, random_initialization};
fn euclidean_distances(a: &ArrayView2<f64>) -> Array2<f64> {
let n = a.nrows();
Array2::from_shape_fn((n, n), |(i, j)| {
let x = a.index_axis(Axis(0), i);
let y = a.index_axis(Axis(0), j);
distance(&x, &y)
})
}
fn distance(x: &ArrayView1<f64>, y: &ArrayView1<f64>) -> f64 {
let v = x - y;
v.dot(&v).sqrt()
}
fn main() {
const N: usize = 64;
const D: usize = 2;
const K: usize = 3;
let mut rng = rand::rngs::StdRng::seed_from_u64(0);
// Test data generation
let (sample, weight) = {
let uniform = rand::distributions::Uniform::new(0., 1.);
let sample = Array2::<f64>::random_using((N, D), uniform, &mut rng);
let mut weight = Array1::<f64>::random_using(N, uniform, &mut rng);
weight /= weight.sum();
weight *= N as f64;
(sample, weight)
};
// K-medoids
let diss = euclidean_distances(&sample.view());
let diss = diss * weight.slice(s![.., NewAxis]); // Apply weights (sample, medoid)
let mut medoids = random_initialization(N, K, &mut rng);
let (loss, labels, _, _): (f64, _, _, _) = fasterpam(&diss, &mut medoids, 100);
dbg!(loss);
// Check loss, before the fix, loss will be inconsistent with loss2 and loss3
let loss2: f64 = itertools::izip!(sample.outer_iter(), labels.iter(), weight.iter())
.map(|(x, &l, &w)| {
let medoid = medoids[l];
let y = sample.index_axis(Axis(0), medoid);
distance(&x, &y) * w
})
.sum();
let loss3: f64 = labels.iter().enumerate()
.map(|(i, &l)| {
let medoid: usize = medoids[l];
diss[[i, medoid]]
})
.sum();
assert_eq!(loss, loss2);
assert_eq!(loss, loss3);
}
TODO: to merge this, all the variants need to be checked to use consistent access to the matrix. When doing this, we can also allow asymmetric matrices, although this has more potential for misuse (by passing a data matrix instead of a distance matrix, very few data matrixes will be square).
I made some of the changes in 6a82530 but have not added any tests for asymmetric use yet. Some methods will be able to support this, others will not (e.g., Silhouette itself is defined based on pairwise distances of samples, and will not support asymmetric use, but the medoid Silhouette can).
I will see if I can have a student assistant add tests for the asymmetric cases eventually.