
preprocessing of chinese summarization dataset LCSTS

  • xmltodict==0.11.0
  • stanfordcorenlp=


A text summarization dataset collected from Sina Weibo social media, contains 2.4M news and corresponding title. Contains 3 files in xml format:

  • PART_I.txt, for training, around 2.39m instance
  • PART_II.txt, for validation, around 10k instance
  • PART_III.txt, for test, 1066 instance

Average length of tokenized article and summary.

Avg. article len Avg. summary len
63.76 10.34


  • Parse xml file, seprate article and summary into two files.
python parser.py PART_I.txt

which will generate two files named as PART_I.article and PART_I.summary

  • tokenize and replace digits with '#'.
python tokenizer.py PART_I.article tokenized/PART_I.article

We use stanfordcorenlp to tokenize so it'll take a long while for PART_I.article, about 3-5 hours.