Using [text,labels] instead of just [text] in Datasets
imthebilliejoe opened this issue · 1 comments
Hi, I'd like to start with a big thanx for your amazing work. I would like to use your library to fine tune GPT-NEO to a Text2Text task instead of TextGeneration. I'm try to adapt your script run_clm.py to handle not only a Dataset with just [text] but with a structure [text,label].
So I'm now trying to create a train_dataset that is built by these new two tokenized dataset, built this way:
def tokenize_function_text(examples): return tokenizer(examples["text"])
tokenized_datasets_text = datasets.map( tokenize_function_text, batched=True, num_proc=data_args.preprocessing_num_workers, remove_columns=column_names, load_from_cache_file=not data_args.overwrite_cache)
def tokenize_function_label(examples): return tokenizer(examples["label"])
tokenized_datasets_label = datasets.map( tokenize_function_label, batched=True, num_proc=data_args.preprocessing_num_workers, remove_columns=column_names, load_from_cache_file=not data_args.overwrite_cache, )
But I'm now really struggling to mix them togheter in a single object "train_dataset" that i want to give to the trainer. Do you have any tips or suggestion to give me?
thank you very much
in case someone was trying to do the same i solved the issue adapting the code on this article: