Using [text,labels] instead of just [text] in Datasets

Question

Using [text,labels] instead of just [text] in Datasets

imthebilliejoe opened this issue 2 years ago · 1 comments

Hi, I'd like to start with a big thanx for your amazing work. I would like to use your library to fine tune GPT-NEO to a Text2Text task instead of TextGeneration. I'm try to adapt your script run_clm.py to handle not only a Dataset with just [text] but with a structure [text,label].

So I'm now trying to create a train_dataset that is built by these new two tokenized dataset, built this way:

def tokenize_function_text(examples): return tokenizer(examples["text"])

tokenized_datasets_text = datasets.map( tokenize_function_text, batched=True, num_proc=data_args.preprocessing_num_workers, remove_columns=column_names, load_from_cache_file=not data_args.overwrite_cache)

def tokenize_function_label(examples): return tokenizer(examples["label"])

tokenized_datasets_label = datasets.map( tokenize_function_label, batched=True, num_proc=data_args.preprocessing_num_workers, remove_columns=column_names, load_from_cache_file=not data_args.overwrite_cache, )

But I'm now really struggling to mix them togheter in a single object "train_dataset" that i want to give to the trainer. Do you have any tips or suggestion to give me?

thank you very much

Answer 1 · 2023-01-03T12:24:09.000Z

in case someone was trying to do the same i solved the issue adapting the code on this article:

http://mohitmayank.com/a_lazy_data_science_guide/natural_language_processing/GPTs/#finetuning-gpt-2-for-sentiment-classification