paragraph_to_sentence.py incorrectly merge some of the sentences
Closed this issue · 2 comments
In some edge cases, https://github.com/neubig/nlp-from-scratch-assignment-2022/blob/main/scripts/paragraph_to_sentence.py#L54 incorrectly merge two sentences together.
For example,
https://github.com/neubig/nlp-from-scratch-assignment-2022/blob/main/data/anlp-sciner-test.txt#L516-L517
is merged into
The candidate with the highest score is chosen as the correct entity , i.e. Linking to Unseen Knowledge Bases
But the correct output should be two separate sentences? (see https://github.com/neubig/nlp-from-scratch-assignment-2022/blob/main/data/anlp-sciner-test-empty.conll#L42980)
The candidate with the highest score is chosen as the correct entity , i.e.
Linking to Unseen Knowledge Bases
My output file of sciner test data is 390 lines shorter than https://github.com/neubig/nlp-from-scratch-assignment-2022/blob/main/data/anlp-sciner-test-empty.conll, possibly because of similar edge cases?
Good catch, after reviewing I did find two edge cases like this one where the sentence was incorrectly combined over a paragraph boundary, we just merged #5 to fix those.
However, your output being shorter than the empty.conll is probably a separate issue, as the number of tokens should still be the same and therefore running sentence_to_paragraph.py
should match the lengths. I would double check your model using the new sentence segmentation script and see if it is still an issue!
Thanks @tjysdsg ! I'll close this because the problem in the title was solved, but please feel free to follow up with @kenzheng99 and me privately if you're still experiencing problems.