Develop metric to check how well free text has been chunked into paragraphs
Closed this issue · 8 comments
From this Blog post, We can see the evaluation metrics Pk and WindowDiff are more popular choice for Text Segmentation(Chunking). To measure evaluation metics, I came across This Python based Library named Segeval and planning to use this to evaluate Pk and WindowDiff for our chunked texts. These methods rely on Ground truth so We need to decide what ground truth should we take. We can even use a dataset with already labeled ground truth.
Some (comparatively latest) work on the above with pre-trained models seems to be here :
https://arxiv.org/abs/2012.03619
Github : https://github.com/dennlinger/TopicalChange
HF model:
https://huggingface.co/dennlinger/bert-wiki-paragraphs
https://huggingface.co/dennlinger/roberta-cls-consec
I have tried Pk and Windowdiff on Choi Dataset and by using cosine similarity model, the results are
PK :- 0.3883495145631068
WindowDiff :- 0.3763888888888889
Here's the link of the google collaboratory which I created to produce these results: https://colab.research.google.com/drive/1L-fQf1JC0NBwUV6N6Oh04XFPaOqIln0w?usp=sharing#scrollTo=5JIs5wSB1Lq4
This is the current benchmark which we need to beat by using different models mentioned in @GautamR-Samagra's comment above.
P.S I have only used 100 paragraphs from choi dataset(~100K characters by length) as model was taking long time to run run on whole dataset i.e ~9200 paragraphs(~10M characters). So the above result is an approximation not an actual result on whole dataset.
Dataset :- Choi Dataset (1500 paragraphs) (~1M Characters)
Pk values for Our method :- 0.4059592263460533
Pk values for Jugalbandi API method :- 0.4774699424986932
Here's the link of Google Colab to reproduce the results.
here's the link to choi dataset used for this testing.
Closing this as implemented by @H4R5H1T-007 in a separate repository. @H4R5H1T-007 can you please comment the repository link here for future reference?
Here's the link of repository for future Reference.
https://github.com/H4R5H1T-007/Document-Uploader-tests
https://github.com/Samagra-Development/ai-tools/tree/restructure/src/chunking/MPNet/local
chunking added to the ai-tools repo here