Note: Due to the limitations of the system, I selected the value of epochs to 3 and model training data size to 20000
numpy
pandas
transformers
sklearn
accelerate
datasets
seqeval
-
Extract the 231110033_assignment2.zip in your local system.
-
Now there are two ipynb files separately(one for IndicBert and one for IndicNER, with their name respectively)
-
Now run each cell either in kaggle(more preferably) or google collaboratory
Identify the named entities for a set of 25 sentences and mark them in BIO format.
Step1: As discussed in the class I have assigned named entities to all the words with the help of classes PER (person), LOC (location), ORG (organisation), MISC (miscellaneous) and O (for others, i.e., non NEs)
Step2: As mentioned in the question there is a file called "1 Ans) Manually Annotated Sentences" inside the zip folder and also I have submitted that on the website given https://bangla.iitk.ac.in/cs689/
In the second question we had to Fine-tune IndicBERT and IndicNER model on the Naamapadam corpus, I chose my mother tongue HINDI with train 70%, 10% validation, and 20% test set splits as already given in the corpus.
I used GPU from Kaggle for this. IndicNER model is available at https://huggingface.co/ai4bharat/IndicNER whereas IndicBERT is available at https://huggingface.co/ai4bharat/indic-bert.
This is done in two python notebook files (IndicBERT and IndicNER).
Inside itself comparison is done and macro-F1 score is mentioned.
For 3rd question I passed the all 25 sentences mentioned on the website where we manually annotated NER tags to the ChatGPT which is present inside the zip folder with name "3 Ans) ChatGPT"
I did this part in the both notebook filese itself, which can be checked via going at the last of the code.
For this question I submitted the report.pdf file which is present inside the zip folder.