pip install requirements.txt
- Download Bert-base-uncased pretrain weights from here, or see a list of Bert model weights download links here
- Download corresponding vocabulary here. Note that the downloaded tar also contains the tensorflow pretrained model weights, but we only need the file
vocab.txt
- Put the pretrained model file, the config json downloaded from the first step, and the vocabulary to
models/pytorch-bert-uncased
directory. - Download imdb dataset here and put it to
data/imdb
- Download bdek dataset(i.e. amazon reviews dataset) here and put it to
data/bdek
- run
sh train_script.sh
in shell- open this file and you'll see different commands for different tasks
- The start point of the program is
train.py
- Files like
trainers.py, evaluators.py, model.py, dataset.py
, etc., defines classes for the corresponding component of the program, and is imported totrain.py
by xx_factory at the bottom of each file. - Developer should add new classes to these files to implement new features instead of editting the existing ones.
- There are several command line args that effect which module to choose from the factories, see the code for details.