shushanxingzhe/transformers_ner

IndexError: list index out of range

drussellmrichie opened this issue · 2 comments

Thanks for providing this code. I'd love to use it, but am getting the following error when running the trainer.

(py38_test) [richier@reslnapollo02 transformers_ner]$ python bert_crf_trainer.py 
Downloading builder script: 9.52kB [00:00, 8.24MB/s]                                                                                               
Downloading metadata: 3.79kB [00:00, 3.99MB/s]                                                                                                     
Reusing dataset conll2003 (/home/richier/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/63f4ebd1bcb7148b1644497336fd74643d4ce70123334431a3c053b7ee4e96ee)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 495.17it/s]
Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 14042
}) Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 3454
})
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertCRF: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertCRF from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertCRF from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertCRF were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'crf.transitions', 'crf.end_transitions', 'classifier.bias', 'crf.start_transitions']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|                                                                                                                        | 0/1 [00:04<?, ?ba/s]
Traceback (most recent call last):
  File "bert_crf_trainer.py", line 59, in <module>
    train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
  File "/home/richier/anaconda3/envs/py38_test/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1955, in map
    return self._map_single(
  File "/home/richier/anaconda3/envs/py38_test/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 520, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/richier/anaconda3/envs/py38_test/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 487, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/richier/anaconda3/envs/py38_test/lib/python3.8/site-packages/datasets/fingerprint.py", line 458, in wrapper
    out = func(self, *args, **kwargs)
  File "/home/richier/anaconda3/envs/py38_test/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2339, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/home/richier/anaconda3/envs/py38_test/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2220, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/home/richier/anaconda3/envs/py38_test/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1915, in decorated
    result = f(decorated_item, *args, **kwargs)
  File "bert_crf_trainer.py", line 24, in tokenize
    tokenids = tokenizer(tokens, add_special_tokens=False)
  File "/home/richier/anaconda3/envs/py38_test/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2288, in __call__
    return self.batch_encode_plus(
  File "/home/richier/anaconda3/envs/py38_test/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2473, in batch_encode_plus
    return self._batch_encode_plus(
  File "/home/richier/anaconda3/envs/py38_test/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 418, in _batch_encode_plus
    for key in tokens_and_encodings[0][0].keys():
IndexError: list index out of range

Any idea what's going on?

Unrelatedly, the datasets package must be updated relative to what is in the requirements.txt. See this issue: huggingface/datasets#3582.

Y1ran commented

same problem, does anyone also meet this?

in function tokenize(batch)
replace

    for tokens, label in zip(batch['tokens'], batch['label_ids']):
        tokenids = tokenizer(tokens, add_special_tokens=False)

with

    for tokens, label in zip(batch['tokens'], batch['label_ids']):
        if tokens != []:
            tokenids = tokenizer(tokens, add_special_tokens=False)