Evaluating BART on CNN/DM : How to process dataset
astariul opened this issue ยท 13 comments
From the README of BART for reproducing CNN/DM results :
Follow instructions here to download and process into data-files such that
test.source
andtest.target
has one line for each non-tokenized sample.
After following instructions, I don't have files like test.source
and test.target
...
Instead, I have test.bin
, and chunked version of this file
(chunked/test_000.bin
~ chunked/test_011.bin
).
How can I process test.bin
into test.source
and test.target
?
thanks for the interest. you need to remove https://github.com/abisee/cnn-dailymail/blob/b15ad0a2db0d407a84b8ca9b5731e1f1c4bd24b9/make_datafiles.py#L235
and comment out all the tf code in function write_to_bin.
You need to keep the raw data (no tokenization) to gpt2 bpe.
Note
I also had to modify this line :
In order to remove <s>
and </s>
from the target file.
Note 2
To get better results, I also had to keep text cased. In order to do this, I removed this line :
I followed these instructions but I'm getting .bin
files instead of .source
and .target
files. Am I missing something? I'm also trying to reproduce these results.
I modified the write_to_bin
function to the following. Is this the correct data format?
def write_to_bin(url_file, out_file, makevocab=False):
"""Reads the tokenized .story files corresponding to the urls listed in the url_file and writes them to a out_file."""
print "Making bin file for URLs listed in %s..." % url_file
url_list = read_text_file(url_file)
url_hashes = get_url_hashes(url_list)
story_fnames = [s+".story" for s in url_hashes]
num_stories = len(story_fnames)
if makevocab:
vocab_counter = collections.Counter()
with open('%s.target' %(out_file), 'wb') as target_file:
with open('%s.source' %(out_file), 'wb') as source_file:
for idx,s in enumerate(story_fnames):
if idx % 1000 == 0:
print "Writing story %i of %i; %.2f percent done" % (idx, num_stories, float(idx)*100.0/float(num_stories))
# Look in the tokenized story dirs to find the .story file corresponding to this url
if os.path.isfile(os.path.join(cnn_tokenized_stories_dir, s)):
story_file = os.path.join(cnn_tokenized_stories_dir, s)
elif os.path.isfile(os.path.join(dm_tokenized_stories_dir, s)):
story_file = os.path.join(dm_tokenized_stories_dir, s)
else:
print "Error: Couldn't find tokenized story file %s in either tokenized story directories %s and %s. Was there an error during tokenization?" % (s, cnn_tokenized_stories_dir, dm_tokenized_stories_dir)
# Check again if tokenized stories directories contain correct number of files
print "Checking that the tokenized stories directories %s and %s contain correct number of files..." % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir)
check_num_stories(cnn_tokenized_stories_dir, num_expected_cnn_stories)
check_num_stories(dm_tokenized_stories_dir, num_expected_dm_stories)
raise Exception("Tokenized stories directories %s and %s contain correct number of files but story file %s found in neither." % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir, s))
# Get the strings to write to .bin file
article, abstract = get_art_abs(story_file)
target_file.write(abstract + '\n')
source_file.write(article + '\n')
There are many details, here is my code.
I fix the over lenght of train.bpe.source caused by ascii '0D' in articles by split and join
I summarize several notes here :
- remove " " before "."
- cased, remove the line of lower cased
- "\r" in origin articles leads error in bpe preprocess
- remove "(CNN)"
- bpe encoding
code : https://gist.github.com/zhaoguangxiang/45bf39c528cf7fb7853bffba7fe57c7e
@zhaoguangxiang Thank you!
Here's a version for Python 3 if anyone is interested:
@zhaoguangxiang
Hi thank you for providing the code for preprocess.
I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper?
I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal
eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).
@zhaoguangxiang Hi thank you for providing the code for preprocess. I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper? I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).
I forgot my reproduction result. I will reply to you after trying again.
@zhaoguangxiang Hi thank you for providing the code for preprocess. I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper? I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).
I forgot my reproduction result. I will reply to you after trying again.
Thank you very much~~ It will help a lot
If anyone still has problems about:
- download and preprocess CNN/DM
- evaluate fine-tuned BART on CNN/DM
You might want to check my reproduction repository https://github.com/BaohaoLiao/NLP-reproduction
@zhaoguangxiang Hi thank you for providing the code for preprocess. I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper? I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).
I forgot my reproduction result. I will reply to you after trying again.
Thank you very much~~ It will help a lot
I forgot my reproduction experience.