Custom Text Dataset

Question

Custom Text Dataset

moodhiaj opened this issue 2 years ago · 6 comments

I am trying to work on my own data in a txt file the source and target sentences are separated by a tab. The problem is I'm not able to use Field and this created many issues in the code for me.
Any help please how can I use my data in field??

Answer 1 · 2022-03-29T02:53:50.000Z

i also want to ask this question!

Answer 2 · 2022-03-29T19:56:51.000Z

If someone is looking for the answer, here what I did and worked for me:
`tokenize = lambda x:x.split(' ')
SRC = Field(tokenize = tokenize)
TRG = Field(tokenize = tokenize,)
fields = {'Source': ('src',SRC), 'Target': ('trg',TRG)}
train_data, valid_data, test_data = torchtext.legacy.data.TabularDataset.splits(
path = '',
train = 'My_train_Set.csv',
test = 'My_test_set.csv',
validation = 'My_Validation_Set.csv',
format = 'csv',
fields = fields)
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)
BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
(train_data, valid_data, test_data),
batch_size = BATCH_SIZE,
sort_within_batch = True,
sort_key = lambda x : len(x.src),
device = device)`

Answer 3 · 2022-03-30T00:34:01.000Z

thank you very much. I also want to know how to store source and target sentence in CSV file. They are paired sentences. 

…

---Original--- From: "Moodhi ***@***.***> Date: Wed, Mar 30, 2022 03:57 AM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [bentrevett/pytorch-seq2seq] Custom Text Dataset (Issue #183) If someone is looking for the answer, here what I did and worked for me: `tokenize = lambda x:x.split(' ') SRC = Field(tokenize = tokenize) TRG = Field(tokenize = tokenize,) fields = {'Source': ('src',SRC), 'Target': ('trg',TRG)} train_data, valid_data, test_data = torchtext.legacy.data.TabularDataset.splits( path = '', train = 'My_train_Set.csv', test = 'My_test_set.csv', validation = 'My_Validation_Set.csv', format = 'csv', fields = fields) SRC.build_vocab(train_data, min_freq=2) TRG.build_vocab(train_data, min_freq=2) BATCH_SIZE = 128 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') train_iterator, valid_iterator, test_iterator = BucketIterator.splits( (train_data, valid_data, test_data), batch_size = BATCH_SIZE, sort_within_batch = True, sort_key = lambda x : len(x.src), device = device)` — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

Answer 4 · 2022-03-30T10:48:21.000Z

I don't know how your data structured but mine was originally in Excel files so I didn't have any problems converting them to CSV.

Answer 5 · 2022-04-01T01:26:44.000Z

can you tell me how to make your own data of the csv format?

Answer 6 · 2022-06-30T03:12:30.000Z

Thanks for this great solution.
Using model with custom dataset is always a big bored and irritable problem