/summarize-scientific-papers

Summarizing scientific papers using textRank and bart-large-cnn finnetuned model.

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Long Document Summarization Task


  • A space to easily summarize the long documents and download the summary.
  • one can also play with the model parameters to get the best results.

Introduction


  • I have done extractive summarization followed by an abstractive summarization on the Scicummnet dataset to summarize the long documents.

  • The Eda is done on the dataset and is demonstrated in notebook.

Dataset


  • The dataset is a collection of scientific papers from the SciCummnets dataset.
  • The dataset contains 1000 papers in total.
  • The dataset contains text in xml format, so I have used xml library to extract the text from the xml files.

Preprocessing


  • Removed Punctuations, Stopwords, Special Characters, Numbers, and unnecessary spaces.

Extractive Summarization


Since extractive summariztion is an unsupervised techneque, I have used textRank algorithm which is based on google's page rank algorithm.

  • Its a graph based algorithm in which an importance score is estimated for each sentence.
  • This algorithm takes the interdependence of sentences on one another in account.
  • After that top sentences are selected based on their importance scores to represent the candidate summary.
  • For representing relationships I used embeddings by sentence-transformers model i.e. all-mpnet-base-v2 model.
  • Note that order of sentences are preserved after picking the top sentences.

Abstractive Summarization


Results


  • The results are good, but I think it can be improved by using more data.
  • The results are demonstrated in the form of a pdf file in the results folder : results/summarization-report.pdf

Libraries