bentrevett/pytorch-seq2seq

Custom Text Dataset

moodhiaj opened this issue · 6 comments

I am trying to work on my own data in a txt file the source and target sentences are separated by a tab. The problem is I'm not able to use Field and this created many issues in the code for me.
Any help please how can I use my data in field??

i also want to ask this question!

If someone is looking for the answer, here what I did and worked for me:
`tokenize = lambda x:x.split(' ')
SRC = Field(tokenize = tokenize)
TRG = Field(tokenize = tokenize,)
fields = {'Source': ('src',SRC), 'Target': ('trg',TRG)}
train_data, valid_data, test_data = torchtext.legacy.data.TabularDataset.splits(
path = '',
train = 'My_train_Set.csv',
test = 'My_test_set.csv',
validation = 'My_Validation_Set.csv',
format = 'csv',
fields = fields)
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)
BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
(train_data, valid_data, test_data),
batch_size = BATCH_SIZE,
sort_within_batch = True,
sort_key = lambda x : len(x.src),
device = device)`

I don't know how your data structured but mine was originally in Excel files so I didn't have any problems converting them to CSV.

can you tell me how to make your own data of the csv format?

Thanks for this great solution.
Using model with custom dataset is always a big bored and irritable problem