- This repository contains the code to generate (i.e., translate) standard Summarization datasets from English to Thai
- Currently, the translation system is Meta's NLLB-200, but other systems can be considered too
- Source = Input Document, Target = Gold Summary
- The translated outputs can be used for experimenting with the cascaded approach or training any models
The translated datasets are open-sourced on HuggingFace, so you can simply download them, for example,
from datasets import load_dataset
dataset = load_dataset("potsawee/xsum_thai")
Completed translated datasets include:
Dataset | Source | Target | Size (train/val/test) | Link |
---|---|---|---|---|
xsum_eng2thai |
✗ | ✓ | 204045/11332/11334 | https://huggingface.co/datasets/potsawee/xsum_eng2thai |
xsum_thai |
✓ | ✓ | 204045/11332/11334 | https://huggingface.co/datasets/potsawee/xsum_thai |
cnn_dailymail_thai |
✓ | ✓ | 287113/13368/11490 | https://huggingface.co/datasets/potsawee/cnn_dailymail_thai |
- Check and add License for each dataset
- Add data statistics, e.g., length, n-gram overlap etc
- Train (fine-tune) baseline summarization models