How to generate the anonymized version?
Oscar860601 opened this issue · 4 comments
Oscar860601 commented
@abisee Did you wrote code for generating anonymized version of cnn-dailymail summarizaition dataset?
AlJohri commented
The original data is from here: https://github.com/danqi/rc-cnn-dailymail
- CNN: http://cs.stanford.edu/~danqi/data/cnn.tar.gz (546M)
- Daily Mail: http://cs.stanford.edu/~danqi/data/dailymail.tar.gz (1.4G)
The code to download them is here: https://github.com/deepmind/rc-data
Oscar860601 commented
Oh I meant anonymized summarization data.
There are only non-anonymized summarization data and anonymized QA data from cnn-dailymail.
I just wondering if there are open source code to obtain non-anonymized summarization data since it's widely used.
Still thanks a lot.
AlJohri commented
The same dataset for QA was repurposed for summarization. If you look at generate_questions.py it should get you most of the way there.
Oscar860601 commented
@AlJohri Thanks!
I will try to modify this code.