Multiple entries csv

Question

Multiple entries csv

kikirizki opened this issue 4 years ago · 1 comments

Hi i come from upwork, is this what are you looking for, split dataset into (multi row csv)


start_token = "|<start of text>|"
end_token = "|<end of text>|"
with open('train.txt', encoding='utf-8') as txtfile:
    all_text = txtfile.read().replace(start_token,"").split(end_token)
    all_text = all_text[0:len(all_text)-1]
with open('train.csv', mode='w', encoding='utf-8') as csv_file:
    fieldnames = ['text']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    for row in all_text:
        writer.writerow({'text': all_text})


with open('validation.txt', encoding='utf-8') as txtfile:
    all_text = txtfile.read().replace(start_token,"").split(end_token)
    all_text = all_text[0:len(all_text)-1]
with open('validation.csv', mode='w', encoding='utf-8') as csv_file:
    fieldnames = ['text']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    for row in all_text:
        writer.writerow({'text': row})

print("created train.csv and validation.csv > files")```

Answer 1 · 2021-04-24T10:46:10.000Z

Yes, that looks correct, if you want the model to view each line (that you defined from start to end token) in the original text file as separate document. This way the model will generate similar things to your examples from the start to the end token.