/parspec_assignment

This repo contains code and info for parspec assignment for DataScientist Role

Primary LanguageJupyter Notebook

Parspec DataScientist Assignment

This repo contains code and info of parspec assignment for DataScientist Role

Checkout the APP HERE -> https://parspec-bert-finetuned-lighting.streamlit.app/


How to Run ?

There are 2 ways to run the code / make predictions

1 - To ease the process, I have already deployed the fine-tuned model [ RECOMMENDED ] You can click this link -> https://parspec-bert-finetuned-lighting.streamlit.app/ and make preds

2 - You can checkout the notebook which I have created. Load the finetuned model and make preds.

PS < to make it more accessible, I have pushed the finetuned model to huggingface hub : Check here -> https://huggingface.co/luci007/LightingData-Bert-Finetuned/tree/main >

  • Clone the repo
  • Install the requirements.txt file
  • Move into code folder -> cd /parspec_assignment/code
  • Run the tutorial.ipynb file to make preds

PS -> You need to have proper modules installed with versions to do so


Files and Folders ?

There are 5 Folders inside the repo

1 - src/ : Folder containing all the main code for Streamlit APP for deployment + Frontend

  • main.py : Main Streamlit code
  • model.py : Loading model and making preds
  • utils.py : Code for Downloading PDF from URL and Extracting text from pdf and cleaning text from extracted text

2 - code/ : This is the main experiment code folder.

  • BERT_BASED_UNCASED_FINETUNED_CLEAN_DATA.ipynb : The Main experiment notebook which has code for loading the data, partitioning, DataLoader, Dataset Class and FineTuning the BERT BASED UNCASED MODEL on Cutom Data and Benchmarking it
  • tutorial.ipynb : A tutorial notebook which you can run manually to check the entire flow and make preds.
  • rough_code : dir : This has my rough code
  • tuts_file : experimenting files which were stored for tutorial purpose. You can use them if you like
  • push_to_hf_hub.ipynb : NB to push my finetuned model to HF hub

3 - clean_data/ : This folder has 2 .csv file

  • training and test csv : These CSV's are created after extracting texts from pdf and cleaning them

4 - holdout_score/ : Folder which has the benchmark score [results.csv] on the holdout data provided. Accuracy - 91.25 %

5 - images : Folder having readme.md images


Progress Timeline

  • : Download the pdf file from the URL and save them in directory [ ~ 1hrs ] - time taken to download, not for code [ start date - 30/11/23 ]
  • : Extract text from the pdf files and save them in a .txt file [ ~ 30 mins ] - time taken to extract text, not for code [ start date - 30/11/23 ]
  • : Convert the extracted text into a csv file [ for both train and test ]. This csv will be used for finetuning model
  • : Cleaning and pre-processing the texts
  • : Finetuned BERT base model and BERT large model, had issues in bert large model. Satisfied with bert base
  • : Try out Debertav3 model from MSFT for text classification [ couldn't do it, since model is large and no compute power]
  • : Benchmarking the scores on test data [ Update : scored 91.25 % Accuracy ]
  • : Add a tutorial NB for the team to run the code and check preds
  • : Uploaded to huggingface hub [link -> https://huggingface.co/luci007/LightingData-Bert-Finetuned]
  • : Host on Streamlit Cloud [ final link -> https://parspec-bert-finetuned-lighting.streamlit.app/]

Some Snapshots from the WEB-APP

[ inp - url , output - lighting pdf ]

inp: url, output: light pdf

[ inp - url , output - no lighting pdf ]

inp: url, output: light pdf

[ inp - pdf , output - lighting pdf ]

inp: url, output: light pdf

[ inp - pdf , output - no lighting pdf ]

inp: url, output: light pdf


Improvements

will talk in depth in interview and also in my documentation

  • : Try out MultiModality Model [ Text + Images ] to be used for context [ time constraint, couldn't do]