Samagra-Development/ai-tools

Develop metric to check how well free text has been chunked into paragraphs

Closed this issue · 8 comments

Develop metric to check how well free text has been chunked into paragraphs

From this Blog post, We can see the evaluation metrics Pk and WindowDiff are more popular choice for Text Segmentation(Chunking). To measure evaluation metics, I came across This Python based Library named Segeval and planning to use this to evaluate Pk and WindowDiff for our chunked texts. These methods rely on Ground truth so We need to decide what ground truth should we take. We can even use a dataset with already labeled ground truth.

I have tried Pk and Windowdiff on Choi Dataset and by using cosine similarity model, the results are
PK :- 0.3883495145631068
WindowDiff :- 0.3763888888888889
Here's the link of the google collaboratory which I created to produce these results: https://colab.research.google.com/drive/1L-fQf1JC0NBwUV6N6Oh04XFPaOqIln0w?usp=sharing#scrollTo=5JIs5wSB1Lq4

This is the current benchmark which we need to beat by using different models mentioned in @GautamR-Samagra's comment above.

P.S I have only used 100 paragraphs from choi dataset(~100K characters by length) as model was taking long time to run run on whole dataset i.e ~9200 paragraphs(~10M characters). So the above result is an approximation not an actual result on whole dataset.

Dataset :- Choi Dataset (1500 paragraphs) (~1M Characters)
Pk values for Our method :- 0.4059592263460533
Pk values for Jugalbandi API method :- 0.4774699424986932
Here's the link of Google Colab to reproduce the results.

Closing this as implemented by @H4R5H1T-007 in a separate repository. @H4R5H1T-007 can you please comment the repository link here for future reference?

Here's the link of repository for future Reference.
https://github.com/H4R5H1T-007/Document-Uploader-tests