Why do we need to tokenized file_id?

Question

Why do we need to tokenized file_id?

macabdul9 opened this issue a year ago · 1 comments

Here

record the id of the sample as token ids

batch["file_id"] = tokenizer(batch[id_column_name], add_special_tokens=False).input_ids

In data preparation for pseudo labelling -

def prepare_dataset(batch):
        # process audio
        sample = batch[audio_column_name]
        inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
        # process audio length
        batch[model_input_name] = inputs.get(model_input_name)[0]

        # process targets
        input_str = batch[text_column_name]
        batch["labels"] = tokenizer(input_str, max_length=max_label_length, truncation=True).input_ids

        # record the id of the sample as token ids
        batch["file_id"] = tokenizer(batch[id_column_name], add_special_tokens=False).input_ids
        return batch

Answer 1 · 2024-03-28T17:39:41.000Z

Fixed in #101!