huggingface/tokenizers

Tokenizer dataset is very slow

ManuSinghYadav opened this issue · 2 comments

I have a fast tokenizer. However, it's still taking about 20 seconds per example for tokenization, which is too slow.

Here's the code,

base_model_id = "google/gemma-7b"
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token
max_length = 1026

def generate_and_tokenize_prompt(prompt):
    result = tokenizer(
        formatting_func(prompt),
        truncation=True,
        max_length=max_length,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

train_dataset = dataset_split['train']
eval_dataset = dataset_split['test']
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

Can someone please help me figure out what I'm missing? Thanks.

It's impossible to reproduce anything with your script since you're not sharing the data.

from transformers import AutoTokenizer
import datetime

base_model_id = "google/gemma-7b"
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token


start = datetime.datetime.now()
tokenizer("This is a test")
print(f"Done in {datetime.datetime.now() - start}")
Done in 0:00:00.000311