ictnlp/ITST

Can you provide the pre-processing script and more details?

Opened this issue · 0 comments

I'd like to replicate the results of the experiment “de2en transformer-base t2t ” , but I can't reproduce the same results.
data: train: WMT15 valid: newstest2013 test: newstest2015
image

I train the model on 2 GPUs,and get the result:

image

After 92 epoch, with the threshold 0.8 my best BLEU was 30.67 , which is a lot less than 32.00 . The same is true for SacreBLEU.

I used the code you provided, and the relevant parameters were set in accordance with the paper.

I guess it is because of the difference in data preprocessing, so can you provide the pre-processing script?
May I ask what other reasons may have led to the present result?
These are my pre-processing steps:
`
src=de
tgt=en

SCRIPTS=/mosesdecoder/scripts
TOKENIZER=${SCRIPTS}/tokenizer/tokenizer.perl
DETOKENIZER=${SCRIPTS}/tokenizer/detokenizer.perl
LC=${SCRIPTS}/tokenizer/lowercase.perl
TRAIN_TC=${SCRIPTS}/recaser/train-truecaser.perl
TC=${SCRIPTS}/recaser/truecase.perl
DETC=${SCRIPTS}/recaser/detruecase.perl
NORM_PUNC=${SCRIPTS}/tokenizer/normalize-punctuation.perl
CLEAN=${SCRIPTS}/training/clean-corpus-n.perl
BPEROOT=/subword_nmt/subword_nmt/
MULTI_BLEU=${SCRIPTS}/generic/multi-bleu.perl
MTEVAL_V14=${SCRIPTS}/generic/mteval-v14.pl

data_dir=/data/wmt15/
model_dir=/data/wmt15/model
tool_dir=/translate/tools

echo "标点符号的标准化..."
perl ${NORM_PUNC} -l ${src} < ${data_dir}/train.${src} > ${data_dir}/norm.${src}
perl ${NORM_PUNC} -l ${tgt} < ${data_dir}/train.${tgt} > ${data_dir}/norm.${tgt}

echo "双语文件进行tokenize处理..."
${TOKENIZER} -l ${src} < ${data_dir}/norm.${src} > ${data_dir}/norm.tok.${src}
${TOKENIZER} -l ${tgt} < ${data_dir}/norm.${tgt} > ${data_dir}/norm.tok.${tgt}

echo "进行大小写转换处理..."
#${TRAIN_TC} --model ${model_dir}/truecase-model.${src} --corpus ${data_dir}/norm.tok.${src}
${TC} --model ${model_dir}/truecase-model.${src} < ${data_dir}/norm.tok.${src} > ${data_dir}/norm.tok.true.${src}
#${TRAIN_TC} --model ${model_dir}/truecase-model.${tgt} --corpus ${data_dir}/norm.tok.${tgt}
${TC} --model ${model_dir}/truecase-model.${tgt} < ${data_dir}/norm.tok.${tgt} > ${data_dir}/norm.tok.true.${tgt}

echo "双语文件(norm.tok.true.src, norm.tok.true.tgt)进行子词处理..."
python ${BPEROOT}/learn_joint_bpe_and_vocab.py --input ${data_dir}/norm.tok.true.${src} -s 32000 -o ${model_dir}/bpecode.${src} --write-vocabulary ${model_dir}/voc.${src}
python ${BPEROOT}/apply_bpe.py -c ${model_dir}/bpecode.${src} --vocabulary ${model_dir}/voc.${src} < ${data_dir}/norm.tok.true.${src} > ${data_dir}/norm.tok.true.bpe.${src}
python ${BPEROOT}/learn_joint_bpe_and_vocab.py --input ${data_dir}/norm.tok.true.${tgt} -s 32000 -o ${model_dir}/bpecode.${tgt} --write-vocabulary ${model_dir}/voc.${tgt}
python ${BPEROOT}/apply_bpe.py -c ${model_dir}/bpecode.${tgt} --vocabulary ${model_dir}/voc.${tgt} < ${data_dir}/norm.tok.true.${tgt} > ${data_dir}/norm.tok.true.bpe.${tgt}

echo "处理后的双语文件(norm.tok.true.bpe.src, norm.tok.true.bpe.tgt)进行过滤..."
mv ${data_dir}/norm.tok.true.bpe.${src} ${data_dir}/toclean.${src}
mv ${data_dir}/norm.tok.true.bpe.${tgt} ${data_dir}/toclean.${tgt}
${CLEAN} ${data_dir}/toclean ${src} ${tgt} ${data_dir}/clean 1 256`