Lightning-Universe/lightning-transformers

Pad token not set error for text classification

Closed this issue ยท 2 comments

๐Ÿ› Bug

python train.py +task=nlp/text_classification +dataset=nlp/text_classification/emotion trainer=sharded backbone.pretrained_model_name_or_path=gpt2
Traceback (most recent call last):
  File "train.py", line 10, in hydra_entry
    main(cfg)
  File "/home/sean/lightning-transformers/lightning_transformers/cli/train.py", line 49, in main
    run(
  File "/home/sean/lightning-transformers/lightning_transformers/cli/train.py", line 32, in run
    data_module.setup("fit")
  File "/home/sean/pytorch-lightning/pytorch_lightning/core/datamodule.py", line 95, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/sean/lightning-transformers/lightning_transformers/core/nlp/huggingface/data.py", line 25, in setup
    dataset = self.process_data(dataset, stage=stage)
  File "/home/sean/lightning-transformers/lightning_transformers/task/nlp/text_classification/data.py", line 13, in process_data
    dataset = TextClassificationDataModule.preprocess(
  File "/home/sean/lightning-transformers/lightning_transformers/task/nlp/text_classification/data.py", line 52, in preprocess
    ds = ds.map(
  File "/home/sean/miniconda3/lib/python3.8/site-packages/datasets/dataset_dict.py", line 286, in map
    {
  File "/home/sean/miniconda3/lib/python3.8/site-packages/datasets/dataset_dict.py", line 287, in <dictcomp>
    k: dataset.map(
  File "/home/sean/miniconda3/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1240, in map
    update_data = does_function_return_dict(test_inputs, test_indices)
  File "/home/sean/miniconda3/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1211, in does_function_return_dict
    function(*fn_args, indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "/home/sean/lightning-transformers/lightning_transformers/task/nlp/text_classification/data.py", line 48, in convert_to_features
    return tokenizer(texts_or_text_pairs, **tokenizer_kwargs)
  File "/home/sean/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2254, in __call__
    return self.batch_encode_plus(
  File "/home/sean/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2430, in batch_encode_plus
    padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
  File "/home/sean/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2152, in _get_padding_truncation_strategies
    raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

We'll need to handle this edge case within the text classification and other classification tasks by using the suggested fix!

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@SeanNaren I'd like to tackle this if possible. This library is so amazing (I just discovered it this week)!

I don't know that this is a bug necessarily, but rather an artifact of certain models that are not bound by a max_sequence_length (GPT, TXL, etc.), see huggingface/transformers#12594

I'll put together a list of the models where this is the case and open a PR if you like. I have it working with 2 simple (but potentially hacky) fixes in text_classification/data.py and text_classification/model.py. I'm sure there is a cleaner way to do it or maybe a better spot where we can check the model that is passed in and adjust behavior accordingly.