对联数据集。
This is a project to fetch couplets from 冯重朴_梨味斋散叶_的博客
This dataset contains more than 700,000 couplets.
Run the spider:
scrapy runspider sina_spider.py
It will store the data into ./output/
.
Download the data
There is an already fetched and cleaned dataset that can be used directly with the seq2seq model. You can download it at here.
The downloaded data contains 5 files:
train/in.txt
: The input of the couplets. Each line is an input. Each word is split by space.train/out.txt
: The output of the couplets. Each line is the output for the same line in thein.txt
. Each word is split by space.test/in.txt
: Same astrain/in.txt
but with less data.test/out.txt
: Same astrain/out.txt
but with less data.vocabs
: Vocabs file. Add<s>
and<\s>
as the first vocabs, which will be used to train in the seq2seq mode.