/VCC

VietnameseCompressionCorpus

Vietnamese manual compression dataset.

The Vietnamese manual compressed dataset is created by using 30 clusters of "Written Vietnamese Cluster Corpus for Document Summarization" of Tran Mai Vu Group, VNU Hanoi University of Engineering and Technology.

Three annotators were asked to perform three compressed versions. The data was organized as XML standard with the source sentence followed by three human-compressed versions:

	<original> source sentence </original>
	<compressed_human_1> compressed version by human 1 </compressed_human_1>
	<compressed_human_2> compressed version by human 2 </compressed_human_2>
	<compressed_human_3> compressed version by human 3 </compressed_human_3>

The compression rate of this manually dataset which refers to the percentage of words retained from the source sentence in the compression is about 67.94%