error writing .bin files after tokenizing
prokopevaleksey opened this issue · 4 comments
Finished tokenizing!
Making bin file for url_lists/all_test.txt...
Writing story 0 of 11490; 0.00 percent done
make_datafiles.py:68: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if line[-1] in END_TOKENS: return line
Writing story 1000 of 11490; 8.70 percent done
Traceback (most recent call last):
File "make_datafiles.py", line 182, in <module>
write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.bin"))
File "make_datafiles.py", line 127, in write_to_bin
raise Exception("can't find story file")
Exception: can't find story file
That piece of code is looking for a story file e.g. something like 0001d1afc246a7964130f43ae940af6bc6c57f01.story
in both cnn_stories_tokenized/
and dm_stories_tokenized/
. It prints that error if it couldn't find the file in either directory.
I'm not sure what's gone wrong here. Do both cnn_stories_tokenized/
and dm_stories_tokenized/
directories exist and do they contain lots of .story
files?
Yes, they consists of 92579 and 178541 files respectively.
It creates test.bin file before failing.
During tokenizing it also produces weird output like this:
cnn stories dir: cnn/stories/
dm stories dir: dailymail/stories/
Preparing to tokenize cnn/stories/...
Making list of files to tokenize...
Tokenizing cnn/stories/...
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable: (U+202A, decimal: 8234)
...
@prokopevaleksey, when I run the code I get 92579 files in dm_stories_tokenized/
and 219506 files in cnn_stories_tokenized/
. If you check the number of files in the cnn/stories
and dailymail/stories
directories you downloaded, you'll find they match these numbers.
I'm not sure why your tokenization phase resulted in some missing dailymail stories, but that must be the problem here.
The Untokenizable: ...
messages you get are Stanford CoreNLP complaining about some unicode characters, I think. For me, they don't prevent the script getting the correct number of tokenized files in cnn_tokenized_stories
and dailymail_tokenized_stories
.
It was a glitch on my side. Thank you a lot. You're awesome!