Multiple entries csv
kikirizki opened this issue · 1 comments
kikirizki commented
Hi i come from upwork, is this what are you looking for, split dataset into (multi row csv)
start_token = "|<start of text>|"
end_token = "|<end of text>|"
with open('train.txt', encoding='utf-8') as txtfile:
all_text = txtfile.read().replace(start_token,"").split(end_token)
all_text = all_text[0:len(all_text)-1]
with open('train.csv', mode='w', encoding='utf-8') as csv_file:
fieldnames = ['text']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
for row in all_text:
writer.writerow({'text': all_text})
with open('validation.txt', encoding='utf-8') as txtfile:
all_text = txtfile.read().replace(start_token,"").split(end_token)
all_text = all_text[0:len(all_text)-1]
with open('validation.csv', mode='w', encoding='utf-8') as csv_file:
fieldnames = ['text']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
for row in all_text:
writer.writerow({'text': row})
print("created train.csv and validation.csv > files")```
Xirider commented
Yes, that looks correct, if you want the model to view each line (that you defined from start to end token) in the original text file as separate document. This way the model will generate similar things to your examples from the start to the end token.