This is our in-house implementation of Transformer model in "Attenion is all you need".
- python 3.5+
- pytorch 3.1
- tqdm
- tensorboardX
See help of ./data/build_dictionary.py
Vocabulary will be stored as json format.
We highly recommend not to set the limitation of the number of words and control it by config files while training.
See examples in ./configs
folder. You can reproduce our
Chinese-to-English Baseline by directly using those configures.
loss_schedule_config.yaml
is the configure file using
valid loss as schedule criterion.
noam_schedule_config.yaml
is the configure file using
schedule method in google's paper.
See training script ./scripts/train.sh
See translation script ./scripts/translation.sh
Decay Method | Use Bucket | MT03(dev) | MT04 | MT05 | MT06 |
---|---|---|---|---|---|
Loss | TRUE | 40.22 | 41.61 | 37.17 | 35.39 |
Loss | FALSE | 41.48 | 42.31 | 39.43 | 36.85 |
Noam | TRUE | 40.50 | 41.90 | 38.19 | 36.12 |
Noam | FALSE | 41.80 | 42.52 | 39.05 | 36.90 |
- What is
shard_size
?
shard_size
is trick borrowed from OpenNMT-py, which
could make large model run in the memory-limited condition.
For example, you can run wmt17 EN2DE task on a 8GB GTX1080 card
with batch size 64 by setting shard_size=10
- What is
use_bucket
?
When using bucket, parallel sentences will be sorted partially according to the length of target sentence.
Set this option to true
will bring considerable improvement
but performance regression.
- This code is heavily borrowed from OpenNMT/OpenNMT-py and have been simplified for research use.