Bart now enforces maximum sequence length in Summarization Pipeline

🐛 Bug

Information

Model I am using (Bert, XLNet ...): Bart (bart-large-cnn)

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

Based on example code in docs, though.

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Load default summarization pipeline
Try to use model to summarize text that has > 1024 tokens

Example code:

from transformers import pipeline
summarizer = pipeline('summarization')
text = '=' * 102570    # Happened to be the length of the file I was testing, my actual file produced 25,257 tokens
print(summarizer(text))

Output:

Token indices sequence length is longer than the specified maximum sequence length for this model (1605 > 1024). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "ex.py", line 4, in <module>
    print(summarizer(text, max_length=250))
  File ".../lib/python3.7/site-packages/transformers/pipelines.py", line 1330, in __call__
    inputs["input_ids"], attention_mask=inputs["attention_mask"], **generate_kwargs,
  File ".../lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File ".../lib/python3.7/site-packages/transformers/modeling_utils.py", line 1047, in generate
    encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)
  File ".../lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.7/site-packages/transformers/modeling_bart.py", line 292, in forward
    embed_pos = self.embed_positions(input_ids)
  File ".../lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.7/site-packages/transformers/modeling_bart.py", line 763, in forward
    return super().forward(positions)
  File ".../lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File ".../lib/python3.7/site-packages/torch/nn/functional.py", line 1724, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Expected behavior

As of last week (week of 4/26/2020) this caused no issue. Today (5/7/2020) I tried to run the exact same code, a new model was downloaded (no change in transformers module, just the model itself), and now it enforces a token limit.

Expected behavior is to summarize document regardless of size.

Environment info

transformers version: 2.8.0 (also occurs in 2.9.0)
Platform: Both macOS 10.15.4 and Windows 10
Python version: 3.7.5 (Mac) and 3.6.3/Anaconda (Windows)
PyTorch version (GPU?): 1.5.0, no GPU
Tensorflow version (GPU?): n/a
Using GPU in script?: no
Using distributed or parallel set-up in script?: no
Model (from associated JSON file downloaded): {"url": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/pytorch_model.bin", "etag": "\"6eeacfe81d9304a6c5015424912f8df8\""}
Model config:

{
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "length_penalty": 2.0,
  "max_length": 142,
  "max_position_embeddings": 1024,
  "min_length": 56,
  "model_type": "bart",
  "no_repeat_ngram_size": 3,
  "normalize_before": false,
  "num_beams": 4,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "prefix": " ",
  "scale_embedding": false,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 142,
      "min_length": 56,
      "no_repeat_ngram_size": 3,
      "num_beams": 4
    }
  },
  "vocab_size": 50264
}

EDIT: Tagging @sshleifer as recommended by docs

#3857 might also be a culprit

@pwschaedler This is a change in pipelines that we may or may not undo. Previously, the tokenizer truncated your long documents to their beginnings
In the meantime, you can use this code on the latest transformers:

from transformers import BartForConditionalGeneration, BartTokenizer
from typing import List

def old_summarization_pipeline(text: List[str]) -> List[str]:
    tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')
    model = BartForConditionalGeneration.from_pretrained('bart-large-cnn')
    input_ids = tokenizer.batch_encode_plus(text, return_tensors='pt', max_length=1024)['input_ids']
    summary_ids = model.generate(input_ids)
    summaries = [tokenizer.decode(s) for s in summary_ids]
    return summaries

text = '=' * 10257  
old_summarization_pipeline(text)

Great, thanks for the replacement code. The token limit (whether it's enforced or implied) might be worth mentioning on the pipeline docs.

Agreed! Would you be interested in sending a PR? The SummarizationPipeline docs live in docs/source/main_classes/pipelines.rst I believe.

Issue still exists when using summarisation pipeline. WARNING:transformers.tokenization_utils:Token indices sequence length is longer than the specified maximum sequence length for this model (2817 > 1024). Running this sequence through the model will result in indexing errors IndexError: index out of range in self
I saw the above work-around but when can we expect this to be fixed in summarization pipeline as well?

I am curious why the token limit in the summarization pipeline stops the process for the default model and for BART but not for the T-5 model? When running "t5-large" in the pipeline it will say "Token indices sequence length is longer than the specified maximum sequence length for this model (1069 > 512)" but it will still produce a summary. With the default model or "facebook/bart-large-cnn" models it will give a similar message "Token indices sequence length is longer than the specified maximum sequence length for this model (1034 > 1024)." but then fail to produce a summary (and give the following "index out of range in self"). Thanks!

Great Q, (prolly belongs on discuss.huggingface.co in the future :))

T5 uses a technique called relative position bucketing, whereas bart stores 1024 positional embeddings and then looks up each position in them.
Note that T5 will likely perform best with sequences <= 512 tokens, but you are correct that it won't error until OOM.

[relevant t5 code] (

transformers/src/transformers/modeling_t5.py

Line 242 in c67d1a0

    
           def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):

)

@sshleifer what's the typical recommendation for summarization on larger documents? Chunk them and generate summaries or any other tips?

EDIT: Cross-posted here, I think this is a much better place for this.

This is what I use currently but open to better recommendations.

# generate chunks of text \ sentences <= 1024 tokens
def nest_sentences(document):
  nested = []
  sent = []
  length = 0
  for sentence in nltk.sent_tokenize(document):
    length += len(sentence)
    if length < 1024:
      sent.append(sentence)
    else:
      nested.append(sent)
      sent = []
      length = 0

  if sent:
    nested.append(sent)
  return nested

# generate summary on text with <= 1024 tokens
def generate_summary(nested_sentences):
  device = 'cuda'
  summaries = []
  for nested in nested_sentences:
    input_tokenized = bart_tokenizer.encode(' '.join(nested), truncation=True, return_tensors='pt')
    input_tokenized = input_tokenized.to(device)
    summary_ids = bart_model.to(device).generate(input_tokenized,
                                      length_penalty=3.0,
                                      min_length=30,
                                      max_length=100)
    output = [bart_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
    summaries.append(output)
  summaries = [sentence for sublist in summaries for sentence in sublist]
  return summaries

Hi!

nest_sentences() has a bug. Whenever a chunk is ready to be saved in 'nested' the current sentence is ignored.

Yes my bad one sentence is skipped, can be fixed as follows. Effects of implementing it in late hours ;)

Good catch @echatzikyriakidis thanks!

# generate chunks of text \ sentences <= 1024 tokens
def nest_sentences(document):
  nested = []
  sent = []
  length = 0
  for sentence in nltk.sent_tokenize(document):
    length += len(sentence)
    if length < 1024:
      sent.append(sentence)
    else:
      nested.append(sent)
      sent = [sentence]
      length = len(sentence)

  if sent:
    nested.append(sent)
  return nested

Hi @dipanjanS !

Thank you! This is exactly the way I did it also.

I think there is another catch.

What if a sentence is > 512 in case of T5 models or > 1024 in case of BART (rare scenario).

I think there will be no problem because of truncation=True, right? Or is going to fail? Maybe we need to skip it or split it in half.

Great. I think in those cases 1024 is a hard coded magic number which can be configurable and replaced with the max length allowed by that specific model maybe as a function parameter

Hi @dipanjanS,

This is the way I have done it.

But again, what if a sentence is greater than the model's l max input length?

What will happen then?

I think if we enforce the truncation parameter it should take care of it. By default it was being done in previous releases of transformers I think but now we might have to set it ourselves. But do check it out once.

…

On Mon, Sep 21, 2020, 00:30 Efstathios Chatzikyriakidis < ***@***.***> wrote: Hi @dipanjanS <https://github.com/dipanjanS>, This is the way I have done it. But again, what if a sentence is greater than the model's l max input length? What will happen then? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4224 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2J3R6Y2O3VR5KBUXYDUBTSGZGN3ANCNFSM4M3342EA> .

Hi @dipanjanS,

Exactly, I have tested it.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Hi @sshleifer first of all thanks for creating and maintaining this repo!

I'm exploring the pipelines and sadly the replacement code you shared no longer works.

I added truncation=True to the tokenizer.batch_encode_plus method but another error happened: ValueError: expected sequence of length 2 at dim 1 (got 3) in tokenization_utils_base.py

I saw in above discussion you were considering undoing this hard limit on the pipelines, perhaps the limit can be exposed in a configuration file or as a parameter?

Could you please suggest how to overcome the hard limit?

This is my current config:

[tool.poetry.dependencies]
python = "^3.8"
transformers = "^4.2.2"
torch = "^1.7.1"

No GPU
OS is Linux
Model: "sshleifer/distilbart-cnn-12-6"

Thanks!

Hi @ig-perez ,
I realize this reply comes a little late to your question, but maybe it can still help you or someone else out. Here is the code from @sshleifer with some modifications to make it work for the current version.

def old_summarization_pipeline(text: List[str]) -> List[str]:
    tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
    model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
    input_ids = tokenizer.batch_encode_plus(text, truncation=True, padding=True, return_tensors='pt', max_length=1024)['input_ids']
    summary_ids = model.generate(input_ids)
    summaries = [tokenizer.decode(s, skip_special_tokens=True, clean_up_tokenization_spaces=False) for s in summary_ids]
    return summaries

print(old_summarization_pipeline([ARTICLE_TO_SUMMARIZE, ARTICLE_TO_SUMMARIZE_2, ARTICLE_TO_SUMMARIZE2*400]))

I tried it with:

transformers=4.4.2
pytorch=1.8.0=py3.8_cuda10.2_cudnn7.6.5_0

Unfortunately, this problem also manifests when deploying BART on SageMaker via sagemaker.huggingface.HuggingFaceModel. When a request with > 1024 tokens is sent, the SageMaker endpoint crashes with an out-of-range CUDA error (we're using GPU instances). What's worse, subsequent requests with smaller inputs fail with the same CUDA error. The only fix is to redeploy the endpoint.

For now, we're using an encode-truncate-decode workaround like below, but there clearly has to be a better way:

# Inputs longer than 1024 tokens cause irrecoverable CUDA errors on
# SageMaker. Make sure that each text is at most 1024 tokens.
inputs = self.tokenizer(texts, max_length=1024, padding="longest",
                        truncation=True)
truncated_texts = [self.tokenizer.decode(i, skip_special_tokens=True, clean_up_tokenization_spaces=False)
                   for i in inputs["input_ids"]]
output = predictor.predict({"inputs": truncated_texts, "parameters": parameters})
summaries = [summary["summary_text"] for summary in output]

@dipanjanS can you write a full code because it is missing a lot of parts

nltk missing
bart_tokenizer missing
bart_model missing

@dipanjanS Thanks for sharing your take on how to chunk large texts for summarization. I follow up on @FurkanGozukara's request: could you possibly provide the parts that are missing?
Thanks in advance for your help.