Naming convention in tokenized dir
ibarrien opened this issue · 2 comments
Question:
What should output filenames look like resulting from tokenize_stories()?
It seems like write_to_bin() expects hashed names in this directory, which I'm not producing directly from tokenize_stories() (i.e. from PTBTokenizer).
Context:
On Mac OS, using stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar.
Example:
If one of the "stories" input file names is "A" then after executing tokenize_stories(), a file called "A" appears in the corresponding tokenized_stories_dir, as opposed to hashhex("A").
It seems PTBTokenizer is working (at least partially) since the tokenized "A" does have, for example, spaces between punctuation marks and -LRB- for left parenthesis.
Outlook:
Specifically, in write_to_bin(), there is
story_fnames = [s+".story" for s in url_hashes]
However, if hashed names are not produced by tokenize_stories(), then a "fix" is
story_fnames = [s + ".story" for s in url_list]
Are you using the Standford 2016 Parser or 2017 Parser? If you want to know how the files look like after tokenization, you can check this repository: https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail
You can check the links and download the required data for training. Let me know if there is any issue.
Hello. The 2016 parsers, specifically from the "stanford-core-nlp-full-2016-10-31" package.
PTBTokenizer does not output hexxed filenames, nor does it seem it's intended to.
Rather, it outputs filenames exactly as provided by the argument provided to the option -ioFileList. E.g. in the following lines from "tokenize_stories()" there is no hexxing; hence, the fix I mentioned above is sufficient in this case.
stories = os.listdir(stories_dir) # the filenames in stories_dir are not assumed to be hexxed
# make IO list file
print "Making list of files to tokenize..."
with open("mapping.txt", "w") as f:
for s in stories:
f.write("%s \t %s\n" % (os.path.join(stories_dir, s), os.path.join(tokenized_stories_dir, s)))