/pleasePatronizeUs

Data Analysis And ML Pcocessing for the Don't Patronize Me Dataset from SEMEVAL 2022 (Task 4)

Primary LanguageJupyter Notebook

					######################## SEMEVAL 2022 INFORMATION ###########################
					# If you are participating on SemEval 2022-Task 4, this is the training     #
					# set provided for the task (there are no trial and no dev sets provided).  #
					# The full test set can be made avaiable upon request after SemEval 2022    #
					# ends.                                                                     #
					#                                                                           #
					# Further details about the task can be found at the following sites:       #
					# Task website: https://sites.google.com/view/pcl-detection-semeval2022/    #
					# Codalab website: https://competitions.codalab.org/competitions/34036      #
					# Google Group: https://groups.google.com/g/pcl-detection-task4-semeval2022 #
					# Organizers email: semeval2022.task4.pcldetection@gmail.com                #
					#############################################################################

Hi and welcome to our groups submission for Group J for Data mining and Machine Learning part 2 of NCI's postgraduate program.

Find the accompanying presentation here:- https://studentncirl-my.sharepoint.com/:v:/g/personal/x20216220_student_ncirl_ie/EWSYpHFInn9AjFkmkP-RBsgBsU-rgTSGJ8UfOL-r8SPkFg?email=x20216238%40student.ncirl.ie

Do complete the tasks mentioned in SemEval please clone the above repository. Each notebook is self contained and should run without local function calls. 

In the notebook ExploratoryAnalysis.ipynb, an exploratory analysis of the words contained in the task is contained, and a breakdown of the binary PCL values is completed against the text

The notebook CNN_RNN_SVM_Bert_Group_J.ipynb contains the CNN, RNN and SVM training algorithms. 

The notebook PatronizeUsBertModelAlphaTraining.ipynb, the layout of training the inital paramaters of a Bert model is established. Unfortunatley, this model may have to be trained on advanced servers, and may fail for local desktops due to the scope of Bert tokenization. 

The notebook PatronizeUsBertModelBetaTrainingTask1.ipynb contains the workings to fine-tune a pre-tokenized bert model, which should achieve high accuracies and f1 scores on the pcl training data. 

The last notebook,BertModelTask2FineTuning.ipynb, looks at fine-tuning the bert model for the second SemEval Task 2. 

As part of this project, work was completed by all members of the team equally.

GitHub link for the code can be found at https://github.com/kevinTheQuigley/pleasePatronizeUs 

So long thanks for all the fish!