Develop metric to check how well free text has been chunked into paragraphs

Question

Develop metric to check how well free text has been chunked into paragraphs

Closed this issue a year ago · 8 comments

Answer 1 · 2023-06-26T03:46:24.000Z

From this Blog post, We can see the evaluation metrics Pk and WindowDiff are more popular choice for Text Segmentation(Chunking). To measure evaluation metics, I came across This Python based Library named Segeval and planning to use this to evaluate Pk and WindowDiff for our chunked texts. These methods rely on Ground truth so We need to decide what ground truth should we take. We can even use a dataset with already labeled ground truth.

Answer 2 · 2023-06-29T09:02:16.000Z

Some (comparatively latest) work on the above with pre-trained models seems to be here :
https://arxiv.org/abs/2012.03619

Github : https://github.com/dennlinger/TopicalChange

HF model:
https://huggingface.co/dennlinger/bert-wiki-paragraphs
https://huggingface.co/dennlinger/roberta-cls-consec

Answer 3 · 2023-07-04T10:14:41.000Z

I have tried Pk and Windowdiff on Choi Dataset and by using cosine similarity model, the results are
PK :- 0.3883495145631068
WindowDiff :- 0.3763888888888889
Here's the link of the google collaboratory which I created to produce these results: https://colab.research.google.com/drive/1L-fQf1JC0NBwUV6N6Oh04XFPaOqIln0w?usp=sharing#scrollTo=5JIs5wSB1Lq4

This is the current benchmark which we need to beat by using different models mentioned in @GautamR-Samagra's comment above.

P.S I have only used 100 paragraphs from choi dataset(~100K characters by length) as model was taking long time to run run on whole dataset i.e ~9200 paragraphs(~10M characters). So the above result is an approximation not an actual result on whole dataset.

Answer 4 · 2023-07-12T07:26:38.000Z

Dataset :- Choi Dataset (1500 paragraphs) (~1M Characters)
Pk values for Our method :- 0.4059592263460533
Pk values for Jugalbandi API method :- 0.4774699424986932
Here's the link of Google Colab to reproduce the results.

Answer 5 · 2023-07-21T03:13:20.000Z

here's the link to choi dataset used for this testing.

Answer 6 · 2023-08-23T08:11:40.000Z

Closing this as implemented by @H4R5H1T-007 in a separate repository. @H4R5H1T-007 can you please comment the repository link here for future reference?

Answer 7 · 2023-08-24T06:09:43.000Z

Here's the link of repository for future Reference.
https://github.com/H4R5H1T-007/Document-Uploader-tests

Answer 8 · 2023-09-04T05:39:33.000Z

https://github.com/Samagra-Development/ai-tools/tree/restructure/src/chunking/MPNet/local

chunking added to the ai-tools repo here