Xirider/finetune-gpt2xl

Multiple entries csv

kikirizki opened this issue · 1 comments

Hi i come from upwork, is this what are you looking for, split dataset into (multi row csv)


start_token = "|<start of text>|"
end_token = "|<end of text>|"
with open('train.txt', encoding='utf-8') as txtfile:
    all_text = txtfile.read().replace(start_token,"").split(end_token)
    all_text = all_text[0:len(all_text)-1]
with open('train.csv', mode='w', encoding='utf-8') as csv_file:
    fieldnames = ['text']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    for row in all_text:
        writer.writerow({'text': all_text})


with open('validation.txt', encoding='utf-8') as txtfile:
    all_text = txtfile.read().replace(start_token,"").split(end_token)
    all_text = all_text[0:len(all_text)-1]
with open('validation.csv', mode='w', encoding='utf-8') as csv_file:
    fieldnames = ['text']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    for row in all_text:
        writer.writerow({'text': row})

print("created train.csv and validation.csv > files")```

Yes, that looks correct, if you want the model to view each line (that you defined from start to end token) in the original text file as separate document. This way the model will generate similar things to your examples from the start to the end token.