Cannot generate WSJ data
FilippoC opened this issue · 1 comments
FilippoC commented
Hello,
I am trying to generate the WSJ data.
I installed python 3.7 in order to be able to install pytokenizations 3.7.
However, we calling ./build_corpus.sh I have the following error:
% ./build_corpus.sh
Traceback (most recent call last):
File "recover_whitespace.py", line 295, in <module>
write_to_file(args.treebank3_root, args.treebank3_root, train_splits, 'train_02-21.LDC99T42', 'train_02-21.LDC99T42.text')
File "recover_whitespace.py", line 263, in write_to_file
words_and_whitespace = get_words_and_whitespace(treebank3_root, splits, [tree_file])
File "recover_whitespace.py", line 162, in get_words_and_whitespace
raw_sents = get_raw_text_for_trees(treebank_root, splits, tree_files)
File "recover_whitespace.py", line 84, in get_raw_text_for_trees
line = next(line_iter)
StopIteration
Note that I have the correct WSJ versions (both datasets), and many sentences are preprocessed before this error is triggered:
% wc -l train_02-21.LDC2015T13
39831 train_02-21.LDC2015T13
% wc -l train_02-21.LDC99T42
39832 train_02-21.LDC99T42
Any idea why I get this error?
Thanks.
YikunHan42 commented
Hi,
May I ask if you have solved this problem? I'm having the same problem now.
Thanks.