Tokenizer dataset is very slow
ManuSinghYadav opened this issue · 2 comments
ManuSinghYadav commented
I have a fast tokenizer. However, it's still taking about 20 seconds per example for tokenization, which is too slow.
Here's the code,
base_model_id = "google/gemma-7b"
tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
padding_side="left",
add_eos_token=True,
add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token
max_length = 1026
def generate_and_tokenize_prompt(prompt):
result = tokenizer(
formatting_func(prompt),
truncation=True,
max_length=max_length,
padding="max_length",
)
result["labels"] = result["input_ids"].copy()
return result
train_dataset = dataset_split['train']
eval_dataset = dataset_split['test']
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)
Can someone please help me figure out what I'm missing? Thanks.
Narsil commented
It's impossible to reproduce anything with your script since you're not sharing the data.
from transformers import AutoTokenizer
import datetime
base_model_id = "google/gemma-7b"
tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
padding_side="left",
add_eos_token=True,
add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token
start = datetime.datetime.now()
tokenizer("This is a test")
print(f"Done in {datetime.datetime.now() - start}")
Done in 0:00:00.000311