Pad token not set error for text classification
Closed this issue ยท 2 comments
๐ Bug
python train.py +task=nlp/text_classification +dataset=nlp/text_classification/emotion trainer=sharded backbone.pretrained_model_name_or_path=gpt2
Traceback (most recent call last):
File "train.py", line 10, in hydra_entry
main(cfg)
File "/home/sean/lightning-transformers/lightning_transformers/cli/train.py", line 49, in main
run(
File "/home/sean/lightning-transformers/lightning_transformers/cli/train.py", line 32, in run
data_module.setup("fit")
File "/home/sean/pytorch-lightning/pytorch_lightning/core/datamodule.py", line 95, in wrapped_fn
return fn(*args, **kwargs)
File "/home/sean/lightning-transformers/lightning_transformers/core/nlp/huggingface/data.py", line 25, in setup
dataset = self.process_data(dataset, stage=stage)
File "/home/sean/lightning-transformers/lightning_transformers/task/nlp/text_classification/data.py", line 13, in process_data
dataset = TextClassificationDataModule.preprocess(
File "/home/sean/lightning-transformers/lightning_transformers/task/nlp/text_classification/data.py", line 52, in preprocess
ds = ds.map(
File "/home/sean/miniconda3/lib/python3.8/site-packages/datasets/dataset_dict.py", line 286, in map
{
File "/home/sean/miniconda3/lib/python3.8/site-packages/datasets/dataset_dict.py", line 287, in <dictcomp>
k: dataset.map(
File "/home/sean/miniconda3/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1240, in map
update_data = does_function_return_dict(test_inputs, test_indices)
File "/home/sean/miniconda3/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1211, in does_function_return_dict
function(*fn_args, indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
File "/home/sean/lightning-transformers/lightning_transformers/task/nlp/text_classification/data.py", line 48, in convert_to_features
return tokenizer(texts_or_text_pairs, **tokenizer_kwargs)
File "/home/sean/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2254, in __call__
return self.batch_encode_plus(
File "/home/sean/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2430, in batch_encode_plus
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
File "/home/sean/miniconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2152, in _get_padding_truncation_strategies
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
We'll need to handle this edge case within the text classification and other classification tasks by using the suggested fix!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@SeanNaren I'd like to tackle this if possible. This library is so amazing (I just discovered it this week)!
I don't know that this is a bug necessarily, but rather an artifact of certain models that are not bound by a max_sequence_length
(GPT, TXL, etc.), see huggingface/transformers#12594
I'll put together a list of the models where this is the case and open a PR if you like. I have it working with 2 simple (but potentially hacky) fixes in text_classification/data.py
and text_classification/model.py
. I'm sure there is a cleaner way to do it or maybe a better spot where we can check the model that is passed in and adjust behavior accordingly.