error while running make_datafile.py

Question

error while running make_datafile.py

97yogitha opened this issue 7 years ago · 11 comments

@abisee this is the error that I get when I run the command makefile.py cnn/stories dailymail/stories

Preparing to tokenize cnn/stories to cnn_stories_tokenized...
Making list of files to tokenize...
Tokenizing 92579 files in cnn/stories and saving in cnn_stories_tokenized...
Exception in thread "main" java.io.IOException: Stream closed
	at java.io.BufferedWriter.ensureOpen(BufferedWriter.java:116)
	at java.io.BufferedWriter.write(BufferedWriter.java:221)
	at java.io.Writer.write(Writer.java:157)
	at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:505)
	at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:450)
	at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:813)
Stanford CoreNLP Tokenizer has finished.
Traceback (most recent call last):
  File "make_datafiles.py", line 235, in <module>
    tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir)
  File "make_datafiles.py", line 86, in tokenize_stories
    raise Exception("The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?" % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig))
Exception: The tokenized stories directory cnn_stories_tokenized contains 1 files, but it should contain the same number as cnn/stories (which has 92579 files). Was there an error during `tokenization?`

Answer 1 · 2017-10-30T03:29:55.000Z

Please let me know are you using the stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar or the one with 2017? This error mostly occur when you are not using stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar. Please check.

Answer 2 · 2017-10-30T03:34:05.000Z

I had a similar issue, though not sure if it's the same cause. See: #12

…

On Sun, Oct 29, 2017 at 8:29 PM, Jaffer Wilson ***@***.***> wrote: Please let me know are you using the stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar or the one with 2017? This error mostly occur when you are not using stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar. Please check. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#16 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHM9T-dG4T41-xYVGiyZCN2ZD412WrNAks5sxUKzgaJpZM4QJzMu> .

Answer 3 · 2017-10-30T04:45:46.000Z

I have created already the processed file you can try that without any issue. Here is the link: https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail
Use Python 2.7

Answer 4 · 2017-10-30T05:06:02.000Z

@JafferWilson yes i am using stanford-corenlp-full-2017-09-0/stanford-corenlp-3.8.0.jar. I will use the processed file.

Answer 5 · 2017-10-30T05:28:41.000Z

@97yogitha No do not use the 2017 one.. use 2016 which is mentioned in the Read.me file of the repository.

Answer 6 · 2017-11-23T20:53:11.000Z

@JafferWilson Thanks for the help. I used 3.7.0 from https://stanfordnlp.github.io/CoreNLP/history.html and it worked.

Answer 7 · 2017-12-07T13:34:37.000Z

thanks very much, today I encountered this problem with the newest version 3.8.0, and then I changed to 3.7.0, finally, it worked.

Answer 8 · 2017-12-08T05:01:50.000Z

Please some one close this issue.

Answer 9 · 2018-01-14T13:25:15.000Z

@JafferWilson Could you help in running the nueral network against our own data, how to generate .bin files for our article?

I have clear idea about tokenozation but what about the urls mapping? How to do it?

Answer 10 · 2018-03-07T00:41:50.000Z

Hi @Sharathnasa
You can clone below repository:
https://github.com/dondon2475848/make_datafiles_for_pgn
Run

python make_datafiles.py  ./stories  ./output

It processes your test data into the binary format

Answer 11 · 2019-11-11T09:31:45.000Z

check subprocess.call(command) set classpath using os.environ["CLASSPATH"]='stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar', then run