error writing .bin files after tokenizing

Question

error writing .bin files after tokenizing

prokopevaleksey opened this issue 8 years ago · 4 comments

Finished tokenizing!
Making bin file for url_lists/all_test.txt...
Writing story 0 of 11490; 0.00 percent done
make_datafiles.py:68: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if line[-1] in END_TOKENS: return line
Writing story 1000 of 11490; 8.70 percent done

Traceback (most recent call last):
  File "make_datafiles.py", line 182, in <module>
    write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin"))
  File "make_datafiles.py", line 127, in write_to_bin
    raise Exception("can't find story file")
Exception: can't find story file

Answer 1 · 2017-05-04T03:32:13.000Z

That piece of code is looking for a story file e.g. something like 0001d1afc246a7964130f43ae940af6bc6c57f01.story in both cnn_stories_tokenized/ and dm_stories_tokenized/. It prints that error if it couldn't find the file in either directory.

I'm not sure what's gone wrong here. Do both cnn_stories_tokenized/ and dm_stories_tokenized/ directories exist and do they contain lots of .story files?

Answer 2 · 2017-05-04T09:10:24.000Z

Yes, they consists of 92579 and 178541 files respectively.
It creates test.bin file before failing.
During tokenizing it also produces weird output like this:

cnn stories dir:  cnn/stories/
dm stories dir:  dailymail/stories/
Preparing to tokenize cnn/stories/...
Making list of files to tokenize...
Tokenizing cnn/stories/...
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable: ‪ (U+202A, decimal: 8234)
...

Answer 3 · 2017-05-04T22:29:08.000Z

@prokopevaleksey, when I run the code I get 92579 files in dm_stories_tokenized/ and 219506 files in cnn_stories_tokenized/. If you check the number of files in the cnn/stories and dailymail/stories directories you downloaded, you'll find they match these numbers.

I'm not sure why your tokenization phase resulted in some missing dailymail stories, but that must be the problem here.

The Untokenizable: ... messages you get are Stanford CoreNLP complaining about some unicode characters, I think. For me, they don't prevent the script getting the correct number of tokenized files in cnn_tokenized_stories and dailymail_tokenized_stories.

Answer 4 · 2017-05-08T12:00:02.000Z

It was a glitch on my side. Thank you a lot. You're awesome!