Information extraction Interview Task
Score/Summary in csv folder
xml of pdf of research paper in xml folder
pdf of researchpaper in papaer folder
- Use TFIDF find most important topic(topics) and find all necessary sentences revolving round it using spacy matcher(find patters) --> ALgorithm.
- Using Gorbid I converted PDF to XML which is saved in xml folder.
- Using approach in task 1 found contributing sentences
- Since even contributing sentences have length of about 8000 words used Bigbird Pegasus which can use more than 4000 tokens (unlike 576 in traditional transformer models)
- Since Summeval have dependency issue for macos and kaggle/Colab I used the original rogue score library https://pypi.org/project/rouge-score/ to find metric of absractive summarization compared to original abstract.
All contributing sentences/ Rogue Score / Abstractive summarization is saved in csv folder file
Final rogue score of bigbird model compared to abstract is
'rouge1': Score(precision=0.4372093023255814, recall=0.42152466367713004, fmeasure=0.4292237442922374)
'rouge2': Score(precision=0.14018691588785046, recall=0.13513513513513514, fmeasure=0.13761467889908258)
'rougeL': Score(precision=0.20465116279069767, recall=0.19730941704035873, fmeasure=0.2009132420091324)}
Link : https://www.kaggle.com/code/parthplc/extract-data-from-research-paper-final?scriptVersionId=90756568 Score Csv Link : https://docs.google.com/spreadsheets/d/1l46_Zg5qS1aCgiTycvRk5EBTkq5tS4RRjnauUv4YOmc/edit?usp=sharing