huggingface/transformers

OverflowError: can't convert negative int to unsigned[finetuning XLNet]

ZHAOFEGNSHUN opened this issue · 1 comments

System Info

File "/home/luban/.conda/envs/my/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py", line 592, in tokenize
return self._first_module().tokenize(texts, **kwargs)
File "/home/luban/.conda/envs/my/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py", line 146, in tokenize
self.tokenizer(
File "/home/luban/.conda/envs/my/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2858, in call
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/home/luban/.conda/envs/my/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2944, in _call_one
return self.batch_encode_plus(
File "/home/luban/.conda/envs/my/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3135, in batch_encode_plus
return self._batch_encode_plus(
File "/home/luban/.conda/envs/my/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 496, in _batch_encode_plus
self.set_truncation_and_padding(
File "/home/luban/.conda/envs/my/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 451, in set_truncation_and_padding
self._tokenizer.enable_truncation(**target)
OverflowError: can't convert negative int to unsigned

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from torch.utils.data import DataLoader
import math
from sentence_transformers import SentenceTransformer, LoggingHandler, losses, models, util
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.readers import InputExample
import logging
from datetime import datetime
import sys
import os
import gzip
import csv
from transformers import AutoTokenizer
from transformers import XLNetTokenizer
import logging
logging.getLogger().setLevel(logging.INFO)


model_name = sys.argv[1] if len(sys.argv) > 1 else "/nfs/XLNet/XLNet/xlnet-base-cased"



train_batch_size = 4
num_epochs = 4
model_save_path = (
    "output/training_stsbenchmark_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
)

word_embedding_model = models.Transformer(model_name)

pooling_model = models.Pooling(
    word_embedding_model.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True,
    pooling_mode_cls_token=False,
    pooling_mode_max_tokens=False,
)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])




train_samples = []
dev_samples = []

with open('/nfs/XLNet/XLNet/al.csv', "r", encoding="utf-8") as csvfile:
    reader = csv.DictReader(csvfile)
    
    for row in reader:
        score = float(row["score"]) / 5.0 
        sentence1 = row["sentence1"]
        sentence2 = row["sentence2"]
      

        inp_example = InputExample(texts=[sentence1, sentence2], label=score)
        train_samples.append(inp_example)
        dev_samples.append(inp_example)



train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
train_loss = losses.CosineSimilarityLoss(model=model)

evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")


warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  
print("Warmup-steps: {}".format(warmup_steps))



model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    
    epochs=num_epochs,
    
    warmup_steps=warmup_steps,
    output_path=model_save_path,
    save_best_model = True,
)

Expected behavior

I hope it can run

Hi @ZHAOFEGNSHUN, thanks for raising an issue!

This is a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports.