microsoft/ContextualSP

Could you share the post-processing script or the post-processed train data for UniSar?

cabisarri opened this issue · 4 comments

Could you share the post-processing script or the post-processed train data for UniSar?

Hi @DreamerDeo, could you help on this?

Hi @cabisarri, thanks for your interest in our work.

The train data follows the same format as the dev data.
Given that preprocessing would take a lot of time and users tend to directly adopt the unisar for infence, the current code only preprocess the dev-set. However, you could simple modify the following lines to support train-set then retrain the model by yourself.

# for session in ['train', 'dev']:

# for session in ["train", "dev"]:

Change fairseq-preprocess function here

def running_process(generate_path):
as

cmd = f"python -m multiprocessing_bpe_encoder \
          --encoder-json ./BART-large/encoder.json \
          --vocab-bpe ./BART-large/vocab.bpe \
          --inputs {generate_path}/train.src \
          --outputs {generate_path}/train.bpe.src \
          --workers 1 \
          --keep-empty"
run_command(cmd)

cmd = f"python -m multiprocessing_bpe_encoder \
        --encoder-json ./BART-large/encoder.json \
        --vocab-bpe ./BART-large/vocab.bpe \
        --inputs {generate_path}/train.tgt \
        --outputs {generate_path}/train.bpe.tgt \
        --workers 1 \
        --keep-empty"
run_command(cmd)

cmd = f"python -m multiprocessing_bpe_encoder \
        --encoder-json ./BART-large/encoder.json \
        --vocab-bpe ./BART-large/vocab.bpe \
        --inputs {generate_path}/dev.src \
        --outputs {generate_path}/dev.bpe.src \
        --workers 1 \
        --keep-empty"
run_command(cmd)

cmd = f"python -m multiprocessing_bpe_encoder \
        --encoder-json ./BART-large/encoder.json \
        --vocab-bpe ./BART-large/vocab.bpe \
        --inputs {generate_path}/dev.tgt \
        --outputs {generate_path}/dev.bpe.tgt \
        --workers 1 \
        --keep-empty"
run_command(cmd)

cmd = f'fairseq-preprocess --source-lang "src" --target-lang "tgt" \
    --trainpref {generate_path}/train.bpe \
    --validpref {generate_path}/dev.bpe \
    --destdir {generate_path}/bin \
    --workers 2 \
    --srcdict ./BART-large/dict.src.txt \
    --tgtdict ./BART-large/dict.tgt.txt '

subprocess.Popen(
    cmd, universal_newlines=True, shell=True,
    stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()

Thanks a lot for the details. I will give it a try.

Closed since there is no more activity.