TR-TPBS

A dataset for Thai text summarization.

Update 23 Nov. 2020

The official and larger version of this dataset, called ThaiSum, can be found in this repo. It also comes with several trained models available to download.

Download TR-TPBS Dataset

File	Remark	Size
TR-TPBS	contains `title`, `body`, `summary`, `labels`, `tags` and `url` columns.	2.05 GB

Additional Datasets

These two files are the previous versions of TR-TPBS, before being combined. Be noted that the articles in these files are preprocessed with slightly different filtering-out conditions of that TR-TPBS. The number in the end of datasets’ name indicates the approximate number of articles contained in each dataset. The newest articles contained in these two files are published online up to December 2019.

File	Remark	Size
Thairath-222k	contains `title`, `body`, `summary`, `labels`, `tags`, `url` and `date` columns.	1.72 GB
ThaiPBS-111k	contains similar columns as Thairath-222k’s except `date`.	0.51 GB

Need Pretrained Models for Research Purpose ?

If you would like to obtain pretrained summarization models for research purposes, please contact nakhun.chum(at sign)gmail.com. The following pretrained models are available upon request:

Model	Source code	Size
ARedSum-base	ARedSumSentRank	2.2 GB
ARedSum-CTX	↑	738 MB
BertSumExt	BertSum	2.2 GB
BertSumAbs	↑	3.7 GB
BertSumExtAbs	↑	3.7 GB

Introduction

TR-TPBS is a medium-size dataset, a multi-purpose NLP benchmark, especially for Thai language. This dataset is crawled from Thairath (TR) and ThaiPBS (TPBS) news websites. The main objective of this corpus is for Thai text summarization.

This dataset is the largest news dataset for Thai text summarization since the previous studies on this topic, as far as we know, used small size of dataset up to 500 documents. It was understandable because those studies were based on statistic methods not sequence-to-sequence ones. It didn’t require a large text for training. Therefore, our experiment is the very first study that experimented Thai text summarization with deep learning methods on the largest Thai text summarization dataset. We explored this dataset on both extractive and abstractive methods.

Apart from text summarization objective, TR-TPBS can be used for several other NLP tasks e.g. headline generation, news classification and keyphrase extraction (which may need additional pre-processing).

Dataset Properties

See exploration.ipynb

Experiment Settings and Results

We evaluate the performance of the TR-TPBS dataset using existing extractive and abstractive baselines. Please refer to PreSum, BertSum, and ARedSum for more technical information and their implementation codes.

Exeriment Settings

Both abstractive and extractive Bert-based (including ARedSum) summarization models are trained on a single GPU (NVIDIA TITAN RTX).

Extractive settings

Both BertSumExt and ARedSum models were trained for 100,000 steps with 6000 batch size. The rest of training settings are set identically to BertSum. It took approximately 80 hours to train each extractive model.

Abstractive Settings

All abstractive models were trained for 300,000 steps with 1120 batch size for Bert-based models and 1200 for Tranformers-based models. The rest of training settings are set identically to PreSum. It took approximately 150 hours to train each abstractive model.

We used ‘bert-base-multilingual-cased’ version of BERT in this experiment. We strongly suggest to train all Bert-based models on multiple GPUs for shorten the training time and the better results.

Results

ROUGE F1 of R1 R2 and RL are used to report these experimental results.

TR-TPBS (test set):

Models	R1	R2	RL
Extractive
Oracle	50.89	22.10	50.74
Lead-2	42.98	22.71	42.94
ARedSum	40.35	20.38	40.30
BertSumExt	44.58	20.26	44.51
Abstractive
BertSumAbs	51.09	26.92	51.04
BertSumExtAbs	53.19	28.19	53.13

Collected and Preposessed by

Nakhun Chumpolsathien, School of Computer Science, Beijing Institute of Technology, China
Tanachat Arayachutinan, School of Computer Science, Beijing Institute of Technology, China

License

TR-TPBS, Thairath-222k and ThaiPBS-111k datasets are licensed under MIT License.

Cite this work

@mastersthesis{chumpolsathien_2020, 
    title={Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization},
    author={Chumpolsathien, Nakhun}, 
    year={2020}, 
    school={Beijing Institute of Technology}