Why do we need to tokenized file_id?
macabdul9 opened this issue · 1 comments
macabdul9 commented
Here
record the id of the sample as token ids
batch["file_id"] = tokenizer(batch[id_column_name], add_special_tokens=False).input_ids
In data preparation for pseudo labelling -
def prepare_dataset(batch):
# process audio
sample = batch[audio_column_name]
inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
# process audio length
batch[model_input_name] = inputs.get(model_input_name)[0]
# process targets
input_str = batch[text_column_name]
batch["labels"] = tokenizer(input_str, max_length=max_label_length, truncation=True).input_ids
# record the id of the sample as token ids
batch["file_id"] = tokenizer(batch[id_column_name], add_special_tokens=False).input_ids
return batch
sanchit-gandhi commented
Fixed in #101!