Bart now enforces maximum sequence length in Summarization Pipeline
pwschaedler opened this issue ยท 21 comments
๐ Bug
Information
Model I am using (Bert, XLNet ...): Bart (bart-large-cnn)
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
Based on example code in docs, though.
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Load default summarization pipeline
- Try to use model to summarize text that has > 1024 tokens
Example code:
from transformers import pipeline
summarizer = pipeline('summarization')
text = '=' * 102570 # Happened to be the length of the file I was testing, my actual file produced 25,257 tokens
print(summarizer(text))
Output:
Token indices sequence length is longer than the specified maximum sequence length for this model (1605 > 1024). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
File "ex.py", line 4, in <module>
print(summarizer(text, max_length=250))
File ".../lib/python3.7/site-packages/transformers/pipelines.py", line 1330, in __call__
inputs["input_ids"], attention_mask=inputs["attention_mask"], **generate_kwargs,
File ".../lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File ".../lib/python3.7/site-packages/transformers/modeling_utils.py", line 1047, in generate
encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)
File ".../lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File ".../lib/python3.7/site-packages/transformers/modeling_bart.py", line 292, in forward
embed_pos = self.embed_positions(input_ids)
File ".../lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File ".../lib/python3.7/site-packages/transformers/modeling_bart.py", line 763, in forward
return super().forward(positions)
File ".../lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File ".../lib/python3.7/site-packages/torch/nn/functional.py", line 1724, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
Expected behavior
As of last week (week of 4/26/2020) this caused no issue. Today (5/7/2020) I tried to run the exact same code, a new model was downloaded (no change in transformers module, just the model itself), and now it enforces a token limit.
Expected behavior is to summarize document regardless of size.
Environment info
transformers
version: 2.8.0 (also occurs in 2.9.0)- Platform: Both macOS 10.15.4 and Windows 10
- Python version: 3.7.5 (Mac) and 3.6.3/Anaconda (Windows)
- PyTorch version (GPU?): 1.5.0, no GPU
- Tensorflow version (GPU?): n/a
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
- Model (from associated JSON file downloaded):
{"url": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/pytorch_model.bin", "etag": "\"6eeacfe81d9304a6c5015424912f8df8\""}
- Model config:
{
"_num_labels": 3,
"activation_dropout": 0.0,
"activation_function": "gelu",
"add_final_layer_norm": false,
"attention_dropout": 0.0,
"bos_token_id": 0,
"classif_dropout": 0.0,
"d_model": 1024,
"decoder_attention_heads": 16,
"decoder_ffn_dim": 4096,
"decoder_layerdrop": 0.0,
"decoder_layers": 12,
"decoder_start_token_id": 2,
"dropout": 0.1,
"early_stopping": true,
"encoder_attention_heads": 16,
"encoder_ffn_dim": 4096,
"encoder_layerdrop": 0.0,
"encoder_layers": 12,
"eos_token_id": 2,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2"
},
"init_std": 0.02,
"is_encoder_decoder": true,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_2": 2
},
"length_penalty": 2.0,
"max_length": 142,
"max_position_embeddings": 1024,
"min_length": 56,
"model_type": "bart",
"no_repeat_ngram_size": 3,
"normalize_before": false,
"num_beams": 4,
"num_hidden_layers": 12,
"output_past": true,
"pad_token_id": 1,
"prefix": " ",
"scale_embedding": false,
"task_specific_params": {
"summarization": {
"early_stopping": true,
"length_penalty": 2.0,
"max_length": 142,
"min_length": 56,
"no_repeat_ngram_size": 3,
"num_beams": 4
}
},
"vocab_size": 50264
}
EDIT: Tagging @sshleifer as recommended by docs
@pwschaedler This is a change in pipelines that we may or may not undo. Previously, the tokenizer truncated your long documents to their beginnings
In the meantime, you can use this code on the latest transformers:
from transformers import BartForConditionalGeneration, BartTokenizer
from typing import List
def old_summarization_pipeline(text: List[str]) -> List[str]:
tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('bart-large-cnn')
input_ids = tokenizer.batch_encode_plus(text, return_tensors='pt', max_length=1024)['input_ids']
summary_ids = model.generate(input_ids)
summaries = [tokenizer.decode(s) for s in summary_ids]
return summaries
text = '=' * 10257
old_summarization_pipeline(text)
Great, thanks for the replacement code. The token limit (whether it's enforced or implied) might be worth mentioning on the pipeline docs.
Agreed! Would you be interested in sending a PR? The SummarizationPipeline docs live in docs/source/main_classes/pipelines.rst
I believe.
Issue still exists when using summarisation pipeline. WARNING:transformers.tokenization_utils:Token indices sequence length is longer than the specified maximum sequence length for this model (2817 > 1024). Running this sequence through the model will result in indexing errors IndexError: index out of range in self
I saw the above work-around but when can we expect this to be fixed in summarization pipeline as well?
I am curious why the token limit in the summarization pipeline stops the process for the default model and for BART but not for the T-5 model? When running "t5-large" in the pipeline it will say "Token indices sequence length is longer than the specified maximum sequence length for this model (1069 > 512)" but it will still produce a summary. With the default model or "facebook/bart-large-cnn" models it will give a similar message "Token indices sequence length is longer than the specified maximum sequence length for this model (1034 > 1024)." but then fail to produce a summary (and give the following "index out of range in self"). Thanks!
Great Q, (prolly belongs on discuss.huggingface.co in the future :))
T5 uses a technique called relative position bucketing, whereas bart stores 1024 positional embeddings and then looks up each position in them.
Note that T5 will likely perform best with sequences <= 512 tokens, but you are correct that it won't error until OOM.
[relevant t5 code] (
transformers/src/transformers/modeling_t5.py
Line 242 in c67d1a0
@sshleifer what's the typical recommendation for summarization on larger documents? Chunk them and generate summaries or any other tips?
EDIT: Cross-posted here, I think this is a much better place for this.
This is what I use currently but open to better recommendations.
# generate chunks of text \ sentences <= 1024 tokens
def nest_sentences(document):
nested = []
sent = []
length = 0
for sentence in nltk.sent_tokenize(document):
length += len(sentence)
if length < 1024:
sent.append(sentence)
else:
nested.append(sent)
sent = []
length = 0
if sent:
nested.append(sent)
return nested
# generate summary on text with <= 1024 tokens
def generate_summary(nested_sentences):
device = 'cuda'
summaries = []
for nested in nested_sentences:
input_tokenized = bart_tokenizer.encode(' '.join(nested), truncation=True, return_tensors='pt')
input_tokenized = input_tokenized.to(device)
summary_ids = bart_model.to(device).generate(input_tokenized,
length_penalty=3.0,
min_length=30,
max_length=100)
output = [bart_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
summaries.append(output)
summaries = [sentence for sublist in summaries for sentence in sublist]
return summaries
Hi!
nest_sentences() has a bug. Whenever a chunk is ready to be saved in 'nested' the current sentence is ignored.
Yes my bad one sentence is skipped, can be fixed as follows. Effects of implementing it in late hours ;)
Good catch @echatzikyriakidis thanks!
# generate chunks of text \ sentences <= 1024 tokens
def nest_sentences(document):
nested = []
sent = []
length = 0
for sentence in nltk.sent_tokenize(document):
length += len(sentence)
if length < 1024:
sent.append(sentence)
else:
nested.append(sent)
sent = [sentence]
length = len(sentence)
if sent:
nested.append(sent)
return nested
Hi @dipanjanS !
Thank you! This is exactly the way I did it also.
I think there is another catch.
What if a sentence is > 512 in case of T5 models or > 1024 in case of BART (rare scenario).
I think there will be no problem because of truncation=True, right? Or is going to fail? Maybe we need to skip it or split it in half.
Great. I think in those cases 1024 is a hard coded magic number which can be configurable and replaced with the max length allowed by that specific model maybe as a function parameter
Hi @dipanjanS,
This is the way I have done it.
But again, what if a sentence is greater than the model's l max input length?
What will happen then?
Hi @dipanjanS,
Exactly, I have tested it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi @sshleifer first of all thanks for creating and maintaining this repo!
I'm exploring the pipelines and sadly the replacement code you shared no longer works.
I added truncation=True
to the tokenizer.batch_encode_plus
method but another error happened: ValueError: expected sequence of length 2 at dim 1 (got 3)
in tokenization_utils_base.py
I saw in above discussion you were considering undoing this hard limit on the pipelines, perhaps the limit can be exposed in a configuration file or as a parameter?
Could you please suggest how to overcome the hard limit?
This is my current config:
[tool.poetry.dependencies]
python = "^3.8"
transformers = "^4.2.2"
torch = "^1.7.1"
- No GPU
- OS is Linux
- Model: "sshleifer/distilbart-cnn-12-6"
Thanks!
Hi @ig-perez ,
I realize this reply comes a little late to your question, but maybe it can still help you or someone else out. Here is the code from @sshleifer with some modifications to make it work for the current version.
def old_summarization_pipeline(text: List[str]) -> List[str]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
input_ids = tokenizer.batch_encode_plus(text, truncation=True, padding=True, return_tensors='pt', max_length=1024)['input_ids']
summary_ids = model.generate(input_ids)
summaries = [tokenizer.decode(s, skip_special_tokens=True, clean_up_tokenization_spaces=False) for s in summary_ids]
return summaries
print(old_summarization_pipeline([ARTICLE_TO_SUMMARIZE, ARTICLE_TO_SUMMARIZE_2, ARTICLE_TO_SUMMARIZE2*400]))
I tried it with:
- transformers=4.4.2
- pytorch=1.8.0=py3.8_cuda10.2_cudnn7.6.5_0
Unfortunately, this problem also manifests when deploying BART on SageMaker via sagemaker.huggingface.HuggingFaceModel
. When a request with > 1024 tokens is sent, the SageMaker endpoint crashes with an out-of-range CUDA error (we're using GPU instances). What's worse, subsequent requests with smaller inputs fail with the same CUDA error. The only fix is to redeploy the endpoint.
For now, we're using an encode-truncate-decode workaround like below, but there clearly has to be a better way:
# Inputs longer than 1024 tokens cause irrecoverable CUDA errors on
# SageMaker. Make sure that each text is at most 1024 tokens.
inputs = self.tokenizer(texts, max_length=1024, padding="longest",
truncation=True)
truncated_texts = [self.tokenizer.decode(i, skip_special_tokens=True, clean_up_tokenization_spaces=False)
for i in inputs["input_ids"]]
output = predictor.predict({"inputs": truncated_texts, "parameters": parameters})
summaries = [summary["summary_text"] for summary in output]
@dipanjanS can you write a full code because it is missing a lot of parts
nltk missing
bart_tokenizer missing
bart_model missing
@dipanjanS Thanks for sharing your take on how to chunk large texts for summarization. I follow up on @FurkanGozukara's request: could you possibly provide the parts that are missing?
Thanks in advance for your help.