facebookresearch/fairseq

Evaluating BART on CNN/DM : How to process dataset

astariul opened this issue ยท 13 comments

From the README of BART for reproducing CNN/DM results :

Follow instructions here to download and process into data-files such that test.source and test.target has one line for each non-tokenized sample.

After following instructions, I don't have files like test.source and test.target...

Instead, I have test.bin, and chunked version of this file
(chunked/test_000.bin ~ chunked/test_011.bin).


How can I process test.bin into test.source and test.target ?

@ngoyal2707 @yinhanliu

thanks for the interest. you need to remove https://github.com/abisee/cnn-dailymail/blob/b15ad0a2db0d407a84b8ca9b5731e1f1c4bd24b9/make_datafiles.py#L235
and comment out all the tf code in function write_to_bin.
You need to keep the raw data (no tokenization) to gpt2 bpe.

Note
I also had to modify this line :

https://github.com/abisee/cnn-dailymail/blob/b15ad0a2db0d407a84b8ca9b5731e1f1c4bd24b9/make_datafiles.py#L145

In order to remove <s> and </s> from the target file.

Note 2

To get better results, I also had to keep text cased. In order to do this, I removed this line :

https://github.com/abisee/cnn-dailymail/blob/b15ad0a2db0d407a84b8ca9b5731e1f1c4bd24b9/make_datafiles.py#L122

I followed these instructions but I'm getting .bin files instead of .source and .target files. Am I missing something? I'm also trying to reproduce these results.

I modified the write_to_bin function to the following. Is this the correct data format?

def write_to_bin(url_file, out_file, makevocab=False):
  """Reads the tokenized .story files corresponding to the urls listed in the url_file and writes them to a out_file."""
  print "Making bin file for URLs listed in %s..." % url_file
  url_list = read_text_file(url_file)
  url_hashes = get_url_hashes(url_list)
  story_fnames = [s+".story" for s in url_hashes]
  num_stories = len(story_fnames)

  if makevocab:
    vocab_counter = collections.Counter()

  with open('%s.target' %(out_file), 'wb') as target_file:
      with open('%s.source' %(out_file), 'wb') as source_file:
        for idx,s in enumerate(story_fnames):
            if idx % 1000 == 0:
                print "Writing story %i of %i; %.2f percent done" % (idx, num_stories, float(idx)*100.0/float(num_stories))

            # Look in the tokenized story dirs to find the .story file corresponding to this url
            if os.path.isfile(os.path.join(cnn_tokenized_stories_dir, s)):
                story_file = os.path.join(cnn_tokenized_stories_dir, s)
            elif os.path.isfile(os.path.join(dm_tokenized_stories_dir, s)):
                story_file = os.path.join(dm_tokenized_stories_dir, s)
            else:
                print "Error: Couldn't find tokenized story file %s in either tokenized story directories %s and %s. Was there an error during tokenization?" % (s, cnn_tokenized_stories_dir, dm_tokenized_stories_dir)
                # Check again if tokenized stories directories contain correct number of files
                print "Checking that the tokenized stories directories %s and %s contain correct number of files..." % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir)
                check_num_stories(cnn_tokenized_stories_dir, num_expected_cnn_stories)
                check_num_stories(dm_tokenized_stories_dir, num_expected_dm_stories)
                raise Exception("Tokenized stories directories %s and %s contain correct number of files but story file %s found in neither." % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir, s))

            # Get the strings to write to .bin file
            article, abstract = get_art_abs(story_file)

            target_file.write(abstract + '\n')
            source_file.write(article + '\n')

There are many details, here is my code.

I fix the over lenght of train.bpe.source caused by ascii '0D' in articles by split and join

I summarize several notes here :

  1. remove " " before "."
  2. cased, remove the line of lower cased
  3. "\r" in origin articles leads error in bpe preprocess
  4. remove "(CNN)"
  5. bpe encoding

code : https://gist.github.com/zhaoguangxiang/45bf39c528cf7fb7853bffba7fe57c7e

Here's a version for Python 3 if anyone is interested:

https://github.com/artmatsak/cnn-dailymail

@zhaoguangxiang
Hi thank you for providing the code for preprocess.
I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper?
I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal
eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).

image

@zhaoguangxiang Hi thank you for providing the code for preprocess. I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper? I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).

image

I forgot my reproduction result. I will reply to you after trying again.

@zhaoguangxiang Hi thank you for providing the code for preprocess. I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper? I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).
image

I forgot my reproduction result. I will reply to you after trying again.

Thank you very much~~ It will help a lot

If anyone still has problems about:

  1. download and preprocess CNN/DM
  2. evaluate fine-tuned BART on CNN/DM
    You might want to check my reproduction repository https://github.com/BaohaoLiao/NLP-reproduction

@zhaoguangxiang Hi thank you for providing the code for preprocess. I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper? I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).
image

I forgot my reproduction result. I will reply to you after trying again.

Thank you very much~~ It will help a lot

I forgot my reproduction experience.