OverflowError: can't convert negative int to unsigned[finetuning XLNet]
ZHAOFEGNSHUN opened this issue · 1 comments
System Info
File "/home/luban/.conda/envs/my/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py", line 592, in tokenize
return self._first_module().tokenize(texts, **kwargs)
File "/home/luban/.conda/envs/my/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py", line 146, in tokenize
self.tokenizer(
File "/home/luban/.conda/envs/my/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2858, in call
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/home/luban/.conda/envs/my/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2944, in _call_one
return self.batch_encode_plus(
File "/home/luban/.conda/envs/my/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3135, in batch_encode_plus
return self._batch_encode_plus(
File "/home/luban/.conda/envs/my/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 496, in _batch_encode_plus
self.set_truncation_and_padding(
File "/home/luban/.conda/envs/my/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 451, in set_truncation_and_padding
self._tokenizer.enable_truncation(**target)
OverflowError: can't convert negative int to unsigned
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
from torch.utils.data import DataLoader
import math
from sentence_transformers import SentenceTransformer, LoggingHandler, losses, models, util
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.readers import InputExample
import logging
from datetime import datetime
import sys
import os
import gzip
import csv
from transformers import AutoTokenizer
from transformers import XLNetTokenizer
import logging
logging.getLogger().setLevel(logging.INFO)
model_name = sys.argv[1] if len(sys.argv) > 1 else "/nfs/XLNet/XLNet/xlnet-base-cased"
train_batch_size = 4
num_epochs = 4
model_save_path = (
"output/training_stsbenchmark_" + model_name.replace("/", "-") + "-" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
)
word_embedding_model = models.Transformer(model_name)
pooling_model = models.Pooling(
word_embedding_model.get_word_embedding_dimension(),
pooling_mode_mean_tokens=True,
pooling_mode_cls_token=False,
pooling_mode_max_tokens=False,
)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
train_samples = []
dev_samples = []
with open('/nfs/XLNet/XLNet/al.csv', "r", encoding="utf-8") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
score = float(row["score"]) / 5.0
sentence1 = row["sentence1"]
sentence2 = row["sentence2"]
inp_example = InputExample(texts=[sentence1, sentence2], label=score)
train_samples.append(inp_example)
dev_samples.append(inp_example)
train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
train_loss = losses.CosineSimilarityLoss(model=model)
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-dev")
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)
print("Warmup-steps: {}".format(warmup_steps))
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=num_epochs,
warmup_steps=warmup_steps,
output_path=model_save_path,
save_best_model = True,
)
Expected behavior
I hope it can run
Hi @ZHAOFEGNSHUN, thanks for raising an issue!
This is a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports.