neubig/nlp-from-scratch-assignment-2022

paragraph_to_sentence.py incorrectly merge some of the sentences

Closed this issue · 2 comments

In some edge cases, https://github.com/neubig/nlp-from-scratch-assignment-2022/blob/main/scripts/paragraph_to_sentence.py#L54 incorrectly merge two sentences together.

For example,
https://github.com/neubig/nlp-from-scratch-assignment-2022/blob/main/data/anlp-sciner-test.txt#L516-L517
is merged into

The candidate with the highest score is chosen as the correct entity , i.e. Linking to Unseen Knowledge Bases

But the correct output should be two separate sentences? (see https://github.com/neubig/nlp-from-scratch-assignment-2022/blob/main/data/anlp-sciner-test-empty.conll#L42980)

The candidate with the highest score is chosen as the correct entity , i.e.
Linking to Unseen Knowledge Bases

My output file of sciner test data is 390 lines shorter than https://github.com/neubig/nlp-from-scratch-assignment-2022/blob/main/data/anlp-sciner-test-empty.conll, possibly because of similar edge cases?

Good catch, after reviewing I did find two edge cases like this one where the sentence was incorrectly combined over a paragraph boundary, we just merged #5 to fix those.

However, your output being shorter than the empty.conll is probably a separate issue, as the number of tokens should still be the same and therefore running sentence_to_paragraph.py should match the lengths. I would double check your model using the new sentence segmentation script and see if it is still an issue!

Thanks @tjysdsg ! I'll close this because the problem in the title was solved, but please feel free to follow up with @kenzheng99 and me privately if you're still experiencing problems.