segment-any-text/wtpsplit

Scoring metric, does definition make sense?

Closed this issue · 1 comments

I looked more into the scoring metric and noticed something. You score based on the indices of predicted sentences. However if you for example split two sentences and predict two arbitrary indeces (true indeces that is), lets say [ 23, 83] the scoring is only based on they index 23. Why is that? Because we score the splits, two sentences equal to one split so while 23 marks the split 83 only marks the end of the sentence. This makes sense in a way ... or maybe not - i am not sure. Because if you think about it, even if the algorithm does not recognize the last symbol as the end of a sentence it will still give the index 83, since it is given by [len(s) in predicted_sentences]. Lets assume you have three sentences now, which have the true indeces [23, 83, 140, 158] and lets say for some reason wtpsplit cant recognize the middle sentence. It would return [23, 140, 158] and a smaller f1 score. However if I would input the sentences separately like this [23, 83] and [140, 158] the f1 score would be 1, because 83 and 158 are never considered for scoring. This makes the score dependent on the number of sentences. For example if I score an dataset by aggregating two lines (which represent a sentence) in a loop the results would be much better than if I did with with 5 lines or even 10. There is also an risk involved in losing data, except you take the last sentence of each iteration into the next. Sorry for the text blob, but maybe you guys know a best practice for such a problem :)

Hi, sorry for the late reply.

Also, fyi, it's hard for me to parse text blobs like this, it would be helpful to structure it a bit more.

To answer your questions:

  • Yes, the scoring metric is dependent on the number of sentences, and the ending of the last sentence is not scored.
  • This is a conscious choice. Other libraries like PySBD evaluate by checking if the boundary is in the correct place in pairs of sentences (as you suggested). This makes the task much easier, and it's not really what you care about in a practical setting so I chose the corpus-based metric instead.
  • As you mentioned, there are some caveats to scoring the entire corpus at once like the average of the score on two corpora not being the same as the score on a concatenation of the two corpora. So the choice of metric is dependent on your eval data. Here, the evaluation data is mostly static so this is completely fine. If you evaluate frequently changing data, you might want to go with a pair-based metric instead.