webis-de/small-text

'SequenceClassifierOutput' object has no attribute 'softmax'

Closed this issue · 9 comments

Bug description

Hi there,
Thanks for creating this amazing project. In my recent practice of the notebook Intro: Active Learning for Text Classification with Small-Text . I try to replace query_strategy = PredictionEntropy() with other built-in strategies, e.g.,
query_strategy = ExpectedGradientLength(num_classes) . However, it always raise the same error:

`Train accuracy: 0.80
Test accuracy: 0.57
0%| | 0/8530 [00:00<?, ?it/s]

AttributeError Traceback (most recent call last)
in <cell line: 23>()
23 for i in range(num_queries):
24 # ...where each iteration consists of labelling 20 samples
---> 25 indices_queried = active_learner.query(num_samples=20)
26
27 # Simulate user interaction here. Replace this for real-world usage.

5 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in softmax(input, dim, _stacklevel, dtype)
1856 dim = _get_softmax_dim("softmax", input.dim(), _stacklevel)
1857 if dtype is None:
-> 1858 ret = input.softmax(dim)
1859 else:
1860 ret = input.softmax(dim, dtype=dtype)

AttributeError: 'SequenceClassifierOutput' object has no attribute 'softmax'`

I'd like to know if there are any unknowns for me. Any suggestions would be appreciated!

Hi,

This is an implementation which is only targeted at the KimCNN classifier. You cannot use this strategy with transformer models as it is.

Maybe, I should make a strict check for the classifier type here and throw a more meaningful exception (although this would make it less "pythonic" probably).

See also this issue #58 for a longer explanation.

Hi @chschroeder ,

Thanks for the prompt response. Now I get the point. Here're 2 fellow up questions:

  1. So when I use transformer models, I can only try the following query strategies, which are described in the documentation section - General.
    1.LeastConfidence
    2.PredictionEntropy
    3.BreakingTies
    4.BALD
    5.EmbeddingKMeans
    6.GreedyCoreset
    7.LightweightCoreset
    8.ContrastiveActiveLearning
    9.DiscriminativeActiveLearning
    10.CategoryVectorInconsistencyAndRanking
    11.SEALS
    12.RandomSampling

However, those Pytorch - EGL query strategies are not available for transformer models?

  1. There's no limitation implemented different transformer model to fine tune multi-class task, e.g., :
transformer_model_name = 'distilbert-base-uncased'  # replace bert-base-uncased
tokenizer = AutoTokenizer.from_pretrained(
    transformer_model_name
)
  1. I'd like to know your opinion about fine tuning text classification tasks, I saw you used SetFit as last 2 example notebook, is it because SetFit outperformed bert model in such tasks?

Again, thanks for creating such an amazing approach, I'd like to learn more about small-text.

Cheers,
Leo

  1. So when I use transformer models, I can only try the following query strategies, which are described in the documentation section - General.

All of them should work. Note, that SEALS is a subsampling strategy and cannot be used on its own.

The general idea is that all everything in this library can be mixed and matched, but sometimes there are conceptual obstacles (i.e. for gradient-based strategies you need models that have gradients) or other times it is difficult (or impossible) to provide a completely model-agnostic implementation (as it is the case with EGL).

You could use EGL for transformers in theory. I have tried it once with a less sophisticated implementation and since it did not seemed particular effective it abandoned this idea. In #58 there were mentions of counterexamples, but my impression is that EGL is rarely used nowadays.

  1. There's no limitation implemented different transformer model to fine tune multi-class task, e.g., :

You mean if can use other transformer models here? Many should work, but there are exceptions. I don't have a full list, but I could only try to compile a list of models that work, not one with models that cannot be used. This is due to different mode limplementations in the transformers library and somewhat out of my control.

  1. I'd like to know your opinion about fine tuning text classification tasks, I saw you used SetFit as last 2 example notebook, is it because SetFit outperformed bert model in such tasks?

SetFit generally outperforms BERT. You train on pairs of instances, and since you can form many pairs given a certain amount of instances, you are effectively training on more data. If you want a good starting point, try SetFit with an uncertainty-based query strategy.

Again, thanks for creating such an amazing approach, I'd like to learn more about small-text.

Thanks! These were some good questions. Whenever I have time right now, I am working on the 2.0.0 branch, but if I find some time in between, I will try to persist some of this information to the docs.

Hi @chschroeder ,

Thanks for the suggestions, and I look forward to the 2.0.0 branch. I have found that small text really helps to reduce the training size of my current text classification project. To make it short, the goal is to use product description to predict the correct product category, e.g. seafood, cookies, meat, dairy, ect (42 classes in total). Without active learning, it needs 107k instances to fine tune a distilled BERT model to reach an accuracy of 91%, however, when I apply small-text with PredictionEntropy, it can get the same result with 30k instances in training. Now my interest point is which query strategy could work best for such task. I am trying to exhaustively test small-text's query strategies based on my project, but as you can imagine it takes time and computing resources. Feel free to share your further thoughts if you are interested, i.e. any query strategy I should prioritize? Much appreciated, I will definitely test the performance between SetFit and distillbert to see if switch model can give better performance.

Leo

Hi @Haoyoudoing,

Thanks for the suggestions, and I look forward to the 2.0.0 branch. I have found that small text really helps to reduce the training size of my current text classification project.

Thank you, I'm happy to hear this :).

it needs 107k instances to fine tune a distilled BERT model to reach an accuracy of 91%

Do you have already 107k instances labeled? In this case you might already have more than enough data and could use SetFit to further improve those results.

Now my interest point is which query strategy could work best for such task

With multi-class and so many classes, you might try BreakingTies instead. Otherwise it depends on your data. I assume the class distribution of those products is likely not very balanced?

There's a lot of theory on this, but right now uncertainty is really strong which is surprising, since it also has weaknesses. In your case it might select similar samples (products) over and over again or might select the same classes while neglecting a minority class.

But if you really have 107k labeled samples I would try SetFit with a good sentence transformer model first.

Hi @chschroeder ,

Thanks for the suggestions. Now I do have 107k labelled datasets as benchmark to test those query strategies and models. However, in the future, it will be replaced with unlabelled datasets (maybe 100-500k products) which I need use query strategies to decide which instances need to be annotated by human. And the goal here is to reduce the human labour, cause 1 human annotator might only label 1k instances per day, it will cost a big budget.

Well, in some cases the data distribution for each class can be very unbalanced, i.e. you can have 10k of beverages while 1k of tomato soup.

I would also like to know if there is any query strategy to help "filter out" those repeating product descriptions, e.g. "Nike sneaker white and Nike sneaker blue" or "Greek style feta cheese, 200 g and Greek style feta cheese, 400 g", but making sure to keep those edgecases (which only appear 1 or 2 times in the whole database) at the same time.

I would appreciate it if you could share more insights.

How long are those product descriptions? Very short texts might be a problem.

Otherwise improving the text preprocessing sounds like a good idea. For example, if the weight of a product will likely never influence the class you could try to filter out the weight information (100g, 200g, ...). Iirc there were some product categorization challenges on Kaggle, you could find inspiration regarding the preprocessing there.

Are these categories hierarchical? If so, hierarchical sampling (which small-text does not have yet) may also be interesting.

I am short on time today, but I will add to this during the next days.

Thanks for the suggestions! Yes, text preprocessing will definitely help. And product description can be varied, e.g., some only has the detailed product name, e.g.,

Gain + Odor Defense Liquid Laundry Detergent, Super Fresh Blast Scent, 107 Loads, 4.55 L

well, others can be a long product description, e.g.,

Sunlight Original Fresh 4L, 100wl, Sunlight Fraîcheur originale 4L
•Great Clean – Liquid laundry detergent with the concentrated power to clean, freshen, whiten and brighten clothes with a Fresh scent.
•Concentrated Formula – Each drop of this washing detergent packs a punch with concentrated stain-fighting action to leave clothes looking great.
•Brightens Laundry – with great brightening power, this detergent is an ideal stain remover for clothes, including those hard-to-remove stains.
•Versatile Use – This clothes detergent is compatible with both standard and high efficiency machines and works for all water temperatures, including cold and hot.
•Fresh Scent – Get fresh laundry every day with washing machine detergent that combines hints of citrus, flowers and forest evergreens to make clothes smell great.

The whole project will be categories hierarchical, however, currently, because I haven't seen any hierarchical classifiers work well on our training data. Right now all the models are trained to predict the category in each hierarchical level, e.g., in food&beverage model, only decided whether the product is a bread or soda, then in next beverage model, to predict is this product a soda or a wine, etc.

The 107k labeled dataset is a great starting point. I would investigate this step by step. Put a stratified holdout set aside first on which you can check if it generalizes (after all experiments). Then, split again on the remaining data and take this as train/test for classification/active learning experiments. I would try the following changes in that order: 1) Use SetFit 2) Improve preprocessing and 3) test active learning with uncertainty baseline. Feel free to keep me updated, this sounds interesting.

After this, you can see which classes yield better/worse performance metrics. I would expect that minority classes might be a problem, but with so much data, I would tackle this in a data-driven way.