How can I get kmeans clustered features?

Question

How can I get kmeans clustered features?

Remaxic opened this issue 9 months ago · 6 comments

Hi,
I called the checkpoint_best_legacy_100.pt model using the inference code under the fairseq framework, and I found that the features generated were unclustered. I read in your paper that it is optional whether the output is clustered or not, so I would like to know how can I choose to output the clustered features?

Meanwhile, I have clustered the output using learn_kmeans.py and dump_km_label.py in fairseq framework. I chose n=50 and then decoded it using a trained decoder. I found the results to be very poor. I'm wondering if this is because your model was trained for n=100, so even though the output is continuous features, it only presents the best performance at n=100?

Answer 1 · 2024-02-27T15:52:02.000Z

Clustering is a separate step. You need to use the code in the fairseq framework to do that. Just like you did above.

"I'm wondering if this is because your model was trained for n=100, so even though the output is continuous features, it only presents the best performance at n=100?"
Not necessarily. You can cluster the features into any clusters you want. The key here is to retrain the decoder because even if you cluster into 100 classes, the class ids are going to be different every time you do it.

Answer 2 · 2024-02-28T05:36:15.000Z

I see! Thank you very much!
"Clustering is a separate step“，so what's the difference between the model with classes=100 and classes=500?

Answer 3 · 2024-02-28T05:54:56.000Z

That's the teacher label's number of clusters.

Answer 4 · 2024-02-28T06:48:48.000Z

Thank you！

Answer 5 · 2024-03-16T03:44:39.000Z

@Remaxic Hi. Have you obtained good clustered results? Could you share your script?

Answer 6 · 2024-03-26T10:30:30.000Z

@huangf79 Hi, I just extracted the features of my dataset using contentvec model and generated k-means clustering model with k=50 and k=100 by calling learn_kmeans.py and dump_km_label.py files under fairseq framework. I found that the former performs nowhere near as well as the latter, and does not even meet the basic needs of my downstream task.

I read the papers of the HuBERT model proposers, hoping to find their particular method of training a perfect clustering model. But there doesn't seem to be one, and they didn't perform dimensionality reduction or other special operations, except that the dataset (100h) is much larger than mine (about 44h). Considering that the model performs well with k=100, I'm guessing it has something to do with contentvec's feature extraction capabilities. Perhaps it is not suitable for small codebook tasks, or perhaps a better discretisation idea is needed.

If you have a better clustering idea and would like to let me know, I would be very grateful!