nikitakit/self-attentive-parser

Cannot generate WSJ data

FilippoC opened this issue · 1 comments

Hello,

I am trying to generate the WSJ data.

I installed python 3.7 in order to be able to install pytokenizations 3.7.

However, we calling ./build_corpus.sh I have the following error:

% ./build_corpus.sh
Traceback (most recent call last):
  File "recover_whitespace.py", line 295, in <module>
    write_to_file(args.treebank3_root, args.treebank3_root, train_splits, 'train_02-21.LDC99T42', 'train_02-21.LDC99T42.text')
  File "recover_whitespace.py", line 263, in write_to_file
    words_and_whitespace = get_words_and_whitespace(treebank3_root, splits, [tree_file])
  File "recover_whitespace.py", line 162, in get_words_and_whitespace
    raw_sents = get_raw_text_for_trees(treebank_root, splits, tree_files)
  File "recover_whitespace.py", line 84, in get_raw_text_for_trees
    line = next(line_iter)
StopIteration

Note that I have the correct WSJ versions (both datasets), and many sentences are preprocessed before this error is triggered:

% wc -l train_02-21.LDC2015T13
   39831 train_02-21.LDC2015T13
% wc -l train_02-21.LDC99T42
   39832 train_02-21.LDC99T42

Any idea why I get this error?

Thanks.

Hi,
May I ask if you have solved this problem? I'm having the same problem now.
Thanks.