Long Document Summarization Task

Playground

I have done extractive summarization followed by an abstractive summarization on the Scicummnet dataset to summarize the long documents.
The Eda is done on the dataset and is demonstrated in notebook.

The dataset is a collection of scientific papers from the SciCummnets dataset.
The dataset contains 1000 papers in total.
The dataset contains text in xml format, so I have used xml library to extract the text from the xml files.

Removed Punctuations, Stopwords, Special Characters, Numbers, and unnecessary spaces.

Its a graph based algorithm in which an importance score is estimated for each sentence.
This algorithm takes the interdependence of sentences on one another in account.
After that top sentences are selected based on their importance scores to represent the candidate summary.
For representing relationships I used embeddings by sentence-transformers model i.e. all-mpnet-base-v2 model.
Note that order of sentences are preserved after picking the top sentences.

For abstractive summarization I have finetuned (bart-large-cnn)[https://huggingface.co/facebook/bart-large-cnn] model on the Scicummnet dataset.
Since the model takes a maximum of 1024 tokens extractive summarization is done on the original text and distributions of lengths of text, summary and extractive summary are observed accordingly.
Fine Tuned Model link : bart-large-cnn-finetuned-scientific_summarize

The results are good, but I think it can be improved by using more data.
The results are demonstrated in the form of a pdf file in the results folder : results/summarization-report.pdf