roeeaharoni/unsupervised-domain-clusters

KeyError: 'token_type_ids'

Closed this issue · 3 comments

while running Domain-Cosine Data Selection got this error in inputs:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-30-94419d0057ff> in <module>
     15         print('{} {} {}'.format(domain, split, len(lines)))
     16         if not os.path.exists(save_path):
---> 17             encode_text_file_and_save(file_path, save_path, max_lines_to_encode)
     18         else:
     19             print('already encoded, skipping...')

<ipython-input-29-b770826372b4> in encode_text_file_and_save(file_path, output_path, max_lines_to_encode)
    104     input_features = convert_text_file_to_features(file_path, tokenizer, 
    105                                                    max_length=128,
--> 106                                                    max_lines_to_encode=max_lines_to_encode)
    107     tensor_dataset = features_to_tensor_dataset(input_features)
    108     start = time.time()

<ipython-input-29-b770826372b4> in convert_text_file_to_features(file_path, tokenizer, max_length, pad_token, pad_token_segment_id, mask_padding_with_zero, max_lines_to_encode)
     33             add_special_tokens=True,
     34             max_length=max_length)
---> 35         input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
     36         attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
     37         padding_length = max_length - len(input_ids)

/opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in __getitem__(self, item)
    228         """
    229         if isinstance(item, str):
--> 230             return self.data[item]
    231         elif self._encodings is not None:
    232             return self._encodings[item]

KeyError: 'token_type_ids'

This seems like a versioning issue, try the solution here? lyuqin/HydraNet-WikiSQL#1

This problem is caused by call method of tokenizer class.
try return_token_type_ids=True

ex)

tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer(
            text=example_text,
            add_special_tokens=True,
            max_length=max_length,
            return_token_type_ids=True
         )

Adding return_token_type_ids=True in tokenizer worked for me.