neulab/awesome-align

How to test fine tuned model on parallel data?

MusfiqDehan opened this issue · 3 comments

Hi, I have collected about 2.75 million parallel data of Bengali to English sentences. I have already trained with 1000 sentences for testing purposes but I am not understanding how can I will be able to test my trained model! How can I test the demo on my trained model? How can I load my trained model?
It would be a great help for me if anyone can help me with this issue.
Thanks in advance for the help.

Hi, you can use the following commands to generate aligned word pairs in $OUTPUT_WORD_FILE and see if it makes sense:

DATA_FILE=/path/to/data/file
MODEL_NAME_OR_PATH=/path/to/your/model
OUTPUT_FILE=/path/to/output/file
OUTPUT_WORD_FILE=/path/to/output/word/file

CUDA_VISIBLE_DEVICES=0 awesome-align \
    --output_file=$OUTPUT_FILE \
    --model_name_or_path=$MODEL_NAME_OR_PATH \
    --data_file=$DATA_FILE \
    --extraction 'softmax' \
    --output_word_file=$OUTPUT_WORD_FILE \
    --batch_size 32

If you have the reference alignments (see an example in https://github.com/neulab/awesome-align/blob/master/examples/roen.gold), you can compute the alignment error rates using:

python tools/aer.py $GROUND_TRUTH_FILE $OUTPUT_FILE

, adding --oneRef to the command if your references are one-indexed.

Thank you so much for your response.

After the completion of the training process, I have seen a large file named pytorch_model.bin containing all the training weights got saved in the Outputs/ folder. So, I am guessing this is the final fine-tuned model. Now, I want to load this model to test real-world parallel sentences to check how they are aligned. I just can not figure out how to write the corresponding code for this. I badly need to implement this piece of code as I am intending to perform a PoS tagging task using this aligner.

In simple words, I need to be able to input 2 parallel sentences and find the corresponding alignment.

You can first prepare the parallel sentences in a file (e.g. a file named test.src-tgt) with the same format as https://github.com/neulab/awesome-align/blob/master/examples/enfr.src-tgt.

Then, if your pytorch_model.bin is saved in the directory Outputs, you can just set MODEL_NAME_OR_PATH in the previous command to Outputs

Here's an example command:

DATA_FILE=test.src-tgt
MODEL_NAME_OR_PATH=Outputs
OUTPUT_FILE=output.src-tgt
OUTPUT_WORD_FILE=output.words.src-tgt

CUDA_VISIBLE_DEVICES=0 awesome-align \
    --output_file=$OUTPUT_FILE \
    --model_name_or_path=$MODEL_NAME_OR_PATH \
    --data_file=$DATA_FILE \
    --extraction 'softmax' \
    --output_word_file=$OUTPUT_WORD_FILE \
    --batch_size 32

You can then see the output pairs in the i-j format in the file output.src-tgt, a pair i-j indicates that the i-th word (zero-indexed) of the source sentence is aligned to the j-th word of the target sentence.

You can also see the aligned word pairs in the src_word<sep>tgt_word format in output.words.src-tgt.