SemMT

SemMT, an automatic testing approach for machine translation systems based on semantic similarity checking. It applies round-trip translation and measures the semantic similarity between the original and the translated sentences.

The key insight is that the semantics concerning logical relations and quantifiers in sentences can be captured by regular expressions (or deterministic finite automata) where efficient semantic equivalence/similarity checking algorithms can be applied.

Workflow of SemMT

Step 1. Round-trip Translation

Round-trip Translation translates a given text or sentence into an intermediate language (the forward translation), and then translates the result back into the source language (the back translation). The benefit of adopting RTT in our methodology is that the semantics of the source and back-translated sentences can be uniformly measured and compared in the same language.

Step 2. Regex Transformation

This step abstracts and transforms the source and translated sentences into regular expressions using NL2RE model.

Step 3. Similarity Calculation

This step calculates the semantic similarity between the regular expressions based on three regex-related metrics.

We proposed three metrics to measure semantic similarities:

Regex-based Similarity (SemMT-R): computes the Levenshtein distance between two regular expressions
DFA-based Similarity (SemMT-D): computes the Jaccard similarity between the regular languages of two regular expressions
Hybrid Similarity (SemMT-H): a hybrid metric to enjoy both advantages by combining SemMT-R and SemMT-D with customized weights.

Step 4. Mistranslation Detection

This step detects the mistranslation according to customized thresholds and reports the detected mistranslations.

Reproduction

Prerequisite

Python 3.6+

pip3 install -r requirements.txt

Training Data Preparation

The training data of our regular expression transformer can be found at NL-RX-Synth-Augmented.txt

The original data can be found at NL-RX-Synth.txt

RQ1. Effectiveness of SemMT

To evaluate the effectiveness of our SemMT, we randomly sampled 500 sentences from the NL-RX-Synth dataset, applied the round-trip translation and collected the translation results. We then transformed both the original sentences and the round-trip translation results into regular expressions by the trained transformation model, which is described earlier in the experiment setup.

Dataset with labels can be found here
An executable demo can be found here, it includes:
- A snippet of how similarity metrics are calculated over a pair of sentences
- Visualization of Accuracy, Precision, Recall, Fscore (also Sensitivity and Specificity)

RQ2. Comparison with Existing Works

In RQ2, the 200 initial seeds are in Seeds.

For the experimental result of each work, please see the following folder:

An executable demo can be found here, it includes:

Read in bug report
Visualization of comparisons
Best result presentation

RQ3. Can SemMT find bugs that are not detected by other metrics?

Dataset with labels can be found here