Pytorch reimplementation of quickthoughts paper: https://arxiv.org/pdf/1803.02893.pdf. I've refactored code from original pytorch implementation.
I've refactored original code a bit. Made Learner class for simpler model creation. Support for sentencepiece tokenizer.
-
download data and clean it (lowercase, add space between punctuation)
-
change path to the training data file in conf (
conf/conf_dev.py
keydata_path
) -
if you want to evaluate during training on some downstream task (like classification), do following:
- download downstream dataset, clean it like training data)
- change script in
src_custom/eval.py
, especially inload_encode_data
function so that it knows how to load your dataset. You have to gie it a name - add the name of your dataset to your conf (
conf/conf_dev.py
) in the keydownstream_eval_datasets
(you can add multiple downstream tasks, just make sure you add them to the conf andload_encode_data
knows how to read in the data). During training downstream performance is saved/displayed
-
if you want to use sentencepiece tokenizer, add path to the model in the conf file
-
train model (example script
train_dev.py
or if you want to train from checkpointtrain_dev_from_checkpoint.py
)
If you wan to use model, use script get_vectors_dev.py
as an example