Extreme Multi-label classification - XRTransformer - Using word embeddings of Label descriptions for clustering
codemonk2023 opened this issue ยท 6 comments
I have descriptions for labels and would like to use them as features instead of positive examples. Am looking for a code example to make PECOS use only BERT(my own) encoded label descriptions features instead of TD-IDF. I am fine if it has to be TF-IDF + label descriptions embeddings.
am reading the below notebook s
https://github.com/amzn/pecos/blob/mainline/tutorials/kdd22/Session%202%20Extreme%20Multi-label%20Classification%20with%20PECOS.ipynb
https://github.com/amzn/pecos/blob/mainline/tutorials/kdd22/Session%205%20eXtreme%20Multi-label%20Classification%20with%20XR-Transformer.ipynb
Is creating .npz way to go? is there any example of taking BERT embeddings saving them in .npz file? is it possible to configure not to use TF-IDF for both mult-label classifier which predicts clusters and ranker?
is there a description of the ranker model in terms of tuning it to better?
@jiong-zhang Thank you for wonderful work and trying to see if you have any thoughts on this.
The hierarchical clustering implementation could take both dense and sparse feature matrix as input, you can just run Indexer.gen(label_feat)
with label_feat
being either numpy.ndarray
or scipy.sparse.csr.csr_matrix
.
Yes you can train the model without any TF-IDF feature at all, the sparse X feature is optional.
Generally speaking not using TF-IDF would have negative impact on your model performance but it's data dependent.
Thank you! Do you have any example just testing without TF-IDF and using transformers?
@jiong-zhang Also am following custom dataset example from here. https://github.com/amzn/pecos/blob/mainline/tutorials/kdd22/Session%205%20eXtreme%20Multi-label%20Classification%20with%20XR-Transformer.ipynb
Below is my output-labels.txt which list of label names. Am assuming label_feat need to follow same order as labels in output-labels.txt. Please correct me if am wrong.
class1
class2
class3
class4
Yes the label feature columns corresponds to the lines of the definition of the output labels.
Thank you! Do you have any example just testing without TF-IDF and using transformers?
For CLI tool, argument -x
to python3 -m pecos.xmc.xtransformer.train
is optional.
For python API, argument X_feat
to MLProblemWithText
is optional.
If you don't feed the feature XR-Transformer will not use TF-IDF.
Note that you'll need to construct your own label features or indexer if you don't supply the TF-IDF features.