This repository contains the code for an example text dataloader for sentence similarity training.
The dataset should contain at least three columns. Two columns should contain the text of sentences to compare. The third column should contain the scores of similarity between the respective sentences for training.
Load these columns into memory, preapre a SimpleTokenizer
object and pass them to the TrainDataLoader
class. Then you can start iterating through the dataset in processed batches.
dl = TrainDataLoader(
sentences1,
sentences2,
scores,
tokenizer=prepared_tokenizer,
batch_size=16,
shuffle=True
)
for batch in dl:
# do something with the batch of data
python run_training.py
python -m pytest