Parspec DataScientist Assignment

This repo contains code and info of parspec assignment for DataScientist Role

How to Run ?

There are 2 ways to run the code / make predictions

1 - To ease the process, I have already deployed the fine-tuned model [ RECOMMENDED ] You can click this link -> https://parspec-bert-finetuned-lighting.streamlit.app/ and make preds

2 - You can checkout the notebook which I have created. Load the finetuned model and make preds.

PS < to make it more accessible, I have pushed the finetuned model to huggingface hub : Check here -> https://huggingface.co/luci007/LightingData-Bert-Finetuned/tree/main >

PS -> You need to have proper modules installed with versions to do so

There are 5 Folders inside the repo

1 - src/ : Folder containing all the main code for Streamlit APP for deployment + Frontend

main.py : Main Streamlit code
model.py : Loading model and making preds
utils.py : Code for Downloading PDF from URL and Extracting text from pdf and cleaning text from extracted text

2 - code/ : This is the main experiment code folder.

BERT_BASED_UNCASED_FINETUNED_CLEAN_DATA.ipynb : The Main experiment notebook which has code for loading the data, partitioning, DataLoader, Dataset Class and FineTuning the BERT BASED UNCASED MODEL on Cutom Data and Benchmarking it
tutorial.ipynb : A tutorial notebook which you can run manually to check the entire flow and make preds.
rough_code : dir : This has my rough code
tuts_file : experimenting files which were stored for tutorial purpose. You can use them if you like
push_to_hf_hub.ipynb : NB to push my finetuned model to HF hub

3 - clean_data/ : This folder has 2 .csv file

training and test csv : These CSV's are created after extracting texts from pdf and cleaning them

4 - holdout_score/ : Folder which has the benchmark score [results.csv] on the holdout data provided. Accuracy - 91.25 %

5 - images : Folder having readme.md images

: Try out MultiModality Model [ Text + Images ] to be used for context [ time constraint, couldn't do]