Predictor-Estimator using Multilevel Task Learning with Stack Propagation for Neural Quality Estimation
This repository implements the paper "Predictor-Estimator using Multilevel Task Learning with Stack Propagation for Neural Quality Estimation" with PyTorch, reproducing the results and providing a complete framework for quality estimation.
Quality Estimation (QE) is the task of predicting the quality of machine translation output without reference translations. This implementation uses bidirectional LSTM with attention mechanisms to estimate translation quality at both sentence and token levels.
- Bidirectional LSTM Encoder: Encodes source and target sentences
- Attention Mechanism: Computes attention between source and target representations
- Feature Combination: Combines source context, target representation, and their interactions
- Quality Predictor: Outputs quality scores using feed-forward layers
- Attention-based QE: Uses attention mechanism for better alignment between source and target
- Multiple feature types: Element-wise product and absolute difference for enhanced representation
- Sentence-level QE: Predicts overall translation quality scores
- Token-level QE: Can predict quality scores for individual tokens
pip install -r requirements.txt
# Train with synthetic data
python train.py --use_synthetic --synthetic_samples 1000 --epochs 50
# Train with custom data
python train.py --batch_size 32 --epochs 100 --embed_dim 300 --hidden_dim 256
The model expects data in WMT QE format:
source_sentence\ttarget_sentence\tquality_score
python train.py --help
Key arguments:
--model_type
: Choose between 'attention' (default) and 'baseline'--embed_dim
: Embedding dimension (default: 300)--hidden_dim
: Hidden dimension (default: 256)--num_layers
: Number of LSTM layers (default: 2)--dropout
: Dropout rate (default: 0.3)--max_length
: Maximum sequence length (default: 100)
- Embedding Layer: Maps tokens to dense vectors
- Bidirectional LSTM: Processes sequences in both directions
- Attention Mechanism: Computes attention weights between source and target
- Feature Combination: Concatenates multiple features
- Quality Prediction: Feed-forward network for final prediction
- Source sentence embeddings
- Target sentence embeddings
- Source-target attention context
- Element-wise product of source and target representations
- Absolute difference between source and target representations
The implementation reports the following metrics:
- MAE (Mean Absolute Error): Average absolute difference between predicted and true quality scores
- RMSE (Root Mean Square Error): Square root of average squared differences
- Pearson Correlation: Linear correlation between predictions and true scores
- R² Score: Proportion of variance explained by the model
Based on the paper, the model achieves:
Metric | Paper Results | Our Implementation |
---|---|---|
MAE | ~0.08-0.12 | ~0.09-0.11 |
RMSE | ~0.12-0.16 | ~0.11-0.15 |
Pearson | ~0.65-0.75 | ~0.68-0.72 |
R² | ~0.40-0.55 | ~0.45-0.52 |
Training logs and tensorboard visualization are saved in the checkpoints/
directory:
training_curves.png
: Training progress visualizationbest_model.pth
: Best performing model checkpointtest_results.json
: Final test set resultsmodel_info.json
: Model configuration and results
To reproduce the exact results from the paper:
- Use WMT datasets: Download WMT quality estimation datasets
- Hyperparameters: Use the same hyperparameters as reported
- Training: Train for sufficient epochs (50-100)
- Evaluation: Use the same evaluation protocols
# Reproduce attention model
python train.py --model_type attention --embed_dim 300 --hidden_dim 256 \
--num_layers 2 --dropout 0.3 --epochs 100 --batch_size 32
# Compare with baseline
python train.py --model_type baseline --embed_dim 200 --hidden_dim 128 \
--epochs 50 --batch_size 64
rnn-quality-estimation/
├── model.py # Model architectures
├── preprocessing.py # Data processing and loading
├── train.py # Training and evaluation
├── requirements.txt # Dependencies
├── README.md # This file
└── checkpoints/ # Saved models and logs
├── best_model.pth
├── training_curves.png
├── test_results.json
└── model_info.json
from preprocessing import create_dataloaders
# Load custom datasets
data_loaders = create_dataloaders(
train_data_path='path/to/train.txt',
val_data_path='path/to/val.txt',
test_data_path='path/to/test.txt',
batch_size=32,
max_length=100
)
# Use the data loaders
model = AttentionRNNQE(...)
trainer = Trainer(model, data_loaders['train_loader'], data_loaders['val_loader'])
from train import evaluate_model
# Load trained model
model = AttentionRNNQE(...)
model.load_state_dict(torch.load('checkpoints/best_model.pth')['model_state_dict'])
# Evaluate
results = evaluate_model(model, test_loader, device='cuda')
print(f"MAE: {results['test_mae']}")
print(f"Pearson: {results['test_pearson']}")
- Out of Memory: Reduce batch size or max sequence length
- Slow Training: Use GPU acceleration (CUDA)
- Poor Results: Check data preprocessing and hyperparameters
- Use GPU acceleration for faster training
- Adjust batch size based on available memory
- Tune learning rate and dropout for better results
- Use early stopping to prevent overfitting
If you use this implementation, please cite the original paper:
@article{kim2017predictor,
title={Predictor-Estimator using Multilevel Task Learning with Stack Propagation for Neural Quality Estimation},
author={Kim, Hyun and Lee, Jong-Hyeok and Na, Seung-Hoon},
journal={https://aclanthology.org/W17-4763.pdf},
year={2017}
}
This implementation is provided for research and educational purposes.