How to test fine tuned model on parallel data?
MusfiqDehan opened this issue · 3 comments
Hi, I have collected about 2.75 million parallel data of Bengali to English sentences. I have already trained with 1000 sentences for testing purposes but I am not understanding how can I will be able to test my trained model! How can I test the demo
on my trained model? How can I load my trained model?
It would be a great help for me if anyone can help me with this issue.
Thanks in advance for the help.
Hi, you can use the following commands to generate aligned word pairs in $OUTPUT_WORD_FILE
and see if it makes sense:
DATA_FILE=/path/to/data/file
MODEL_NAME_OR_PATH=/path/to/your/model
OUTPUT_FILE=/path/to/output/file
OUTPUT_WORD_FILE=/path/to/output/word/file
CUDA_VISIBLE_DEVICES=0 awesome-align \
--output_file=$OUTPUT_FILE \
--model_name_or_path=$MODEL_NAME_OR_PATH \
--data_file=$DATA_FILE \
--extraction 'softmax' \
--output_word_file=$OUTPUT_WORD_FILE \
--batch_size 32
If you have the reference alignments (see an example in https://github.com/neulab/awesome-align/blob/master/examples/roen.gold), you can compute the alignment error rates using:
python tools/aer.py $GROUND_TRUTH_FILE $OUTPUT_FILE
, adding --oneRef
to the command if your references are one-indexed.
Thank you so much for your response.
After the completion of the training process, I have seen a large file named pytorch_model.bin
containing all the training weights got saved in the Outputs/
folder. So, I am guessing this is the final fine-tuned model. Now, I want to load this model to test real-world parallel sentences to check how they are aligned. I just can not figure out how to write the corresponding code for this. I badly need to implement this piece of code as I am intending to perform a PoS tagging task using this aligner.
In simple words, I need to be able to input 2 parallel sentences and find the corresponding alignment.
You can first prepare the parallel sentences in a file (e.g. a file named test.src-tgt
) with the same format as https://github.com/neulab/awesome-align/blob/master/examples/enfr.src-tgt.
Then, if your pytorch_model.bin
is saved in the directory Outputs
, you can just set MODEL_NAME_OR_PATH
in the previous command to Outputs
Here's an example command:
DATA_FILE=test.src-tgt
MODEL_NAME_OR_PATH=Outputs
OUTPUT_FILE=output.src-tgt
OUTPUT_WORD_FILE=output.words.src-tgt
CUDA_VISIBLE_DEVICES=0 awesome-align \
--output_file=$OUTPUT_FILE \
--model_name_or_path=$MODEL_NAME_OR_PATH \
--data_file=$DATA_FILE \
--extraction 'softmax' \
--output_word_file=$OUTPUT_WORD_FILE \
--batch_size 32
You can then see the output pairs in the i-j
format in the file output.src-tgt
, a pair i-j
indicates that the i
-th word (zero-indexed) of the source sentence is aligned to the j
-th word of the target sentence.
You can also see the aligned word pairs in the src_word<sep>tgt_word
format in output.words.src-tgt
.