Could you share the post-processing script or the post-processed train data for UniSar?
cabisarri opened this issue · 4 comments
cabisarri commented
Could you share the post-processing script or the post-processed train data for UniSar?
SivilTaram commented
Hi @DreamerDeo, could you help on this?
longxudou commented
Hi @cabisarri, thanks for your interest in our work.
The train data follows the same format as the dev data.
Given that preprocessing would take a lot of time and users tend to directly adopt the unisar for infence, the current code only preprocess the dev-set. However, you could simple modify the following lines to support train-set then retrain the model by yourself.
Change fairseq-preprocess
function here
cmd = f"python -m multiprocessing_bpe_encoder \ --encoder-json ./BART-large/encoder.json \ --vocab-bpe ./BART-large/vocab.bpe \ --inputs {generate_path}/train.src \ --outputs {generate_path}/train.bpe.src \ --workers 1 \ --keep-empty" run_command(cmd) cmd = f"python -m multiprocessing_bpe_encoder \ --encoder-json ./BART-large/encoder.json \ --vocab-bpe ./BART-large/vocab.bpe \ --inputs {generate_path}/train.tgt \ --outputs {generate_path}/train.bpe.tgt \ --workers 1 \ --keep-empty" run_command(cmd) cmd = f"python -m multiprocessing_bpe_encoder \ --encoder-json ./BART-large/encoder.json \ --vocab-bpe ./BART-large/vocab.bpe \ --inputs {generate_path}/dev.src \ --outputs {generate_path}/dev.bpe.src \ --workers 1 \ --keep-empty" run_command(cmd) cmd = f"python -m multiprocessing_bpe_encoder \ --encoder-json ./BART-large/encoder.json \ --vocab-bpe ./BART-large/vocab.bpe \ --inputs {generate_path}/dev.tgt \ --outputs {generate_path}/dev.bpe.tgt \ --workers 1 \ --keep-empty" run_command(cmd) cmd = f'fairseq-preprocess --source-lang "src" --target-lang "tgt" \ --trainpref {generate_path}/train.bpe \ --validpref {generate_path}/dev.bpe \ --destdir {generate_path}/bin \ --workers 2 \ --srcdict ./BART-large/dict.src.txt \ --tgtdict ./BART-large/dict.tgt.txt ' subprocess.Popen( cmd, universal_newlines=True, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
cabisarri commented
Thanks a lot for the details. I will give it a try.
SivilTaram commented
Closed since there is no more activity.