EleanorJiang/BlonDe

Data Error in BWBReader.py

thunder123321 opened this issue · 2 comments

When I run the script "BWBReader.py", the code reports the following error,

(Parrot) root@nlp:~/ParroT-master/BlonDe/BlonDe-main$ python BWB/BWBReader.py
BWB/test_with_annotations
Generating the cached file...
Traceback (most recent call last):
File "BWB/BWBReader.py", line 544, in
for sentences in bwb_reader.dataset_iterator_from_cache(cache_file, dir_path):
File "BWB/BWBReader.py", line 169, in dataset_iterator_from_cache
self.to_cache(dir_path, cache_file)
File "BWB/BWBReader.py", line 530, in to_cache
for sentences in self.dataset_iterator(dir_path):
File "BWB/BWBReader.py", line 212, in dataset_iterator
yield from self.sentence_iterator(chs_path, ref_path)
File "BWB/BWBReader.py", line 268, in sentence_iterator
for chs_document, ref_document in self.dataset_document_iterator(chs_path, ref_path):
File "BWB/BWBReader.py", line 260, in dataset_document_iterator
ref_document.append(self._line_to_BWBsentence(line, "en", document_id, sentence_id))
File "BWB/BWBReader.py", line 414, in _line_to_BWBsentence
k = self._deal_with_ann_span(line, k, mention_stack, quote_stack,
File "BWB/BWBReader.py", line 384, in _deal_with_ann_span
k = self._deal_with_ann_span(line, k, mention_stack, quote_stack,
File "BWB/BWBReader.py", line 384, in _deal_with_ann_span
k = self._deal_with_ann_span(line, k, mention_stack, quote_stack,
File "BWB/BWBReader.py", line 384, in _deal_with_ann_span
k = self._deal_with_ann_span(line, k, mention_stack, quote_stack,
[Previous line repeated 19 more times]
File "BWB/BWBReader.py", line 351, in _deal_with_ann_span
raise RuntimeError(f'the annotated span <{ann_span}> is not followed by a ''. \n'
RuntimeError: the annotated span <O,2> is not followed by a .
document_id: Book0-4, sentence_id: 20

Our "test_with_annotations" was downloaded in this project without any revision. Would you happen to have any suggestions about our question?

Make sure you download the new test_with_annotations tarball in #4. You then have to edit BWB/BWBReader.py to set dir_path to point to BWB/BWB_dataset/test_with_annotations. I made this a command-line argument. After that, I was able to iterate through the dataset (though it's not clear how to score it, since the format is different from what is documented in the top-level blonde README).

Make sure you download the new test_with_annotations tarball in #4. You then have to edit BWB/BWBReader.py to set dir_path to point to BWB/BWB_dataset/test_with_annotations. I made this a command-line argument. After that, I was able to iterate through the dataset (though it's not clear how to score it, since the format is different from what is documented in the top-level blonde README).

Thanks for your response, I have one more question about this test set. I noticed that in the sample file, the document is divided into sentences for evaluation, but our model uses the whole document as input and output, and cannot be automatically segmented into sentences. How can I evaluate it without dividing it into document levels?