Ford Sentence Classification
In this challenge, the goal is to classify a sentence into one of the following categories:
- Responsibility
- Requirement
- Skill
- SoftSkill
- Education
- Experience
I have used BERT Model to classify the sentences. The model is trained on 80% of the data and tested on the remaining 20% of the data. The model is saved as 'ford-sentence-classifiaction' in the directory. I have used BERT base model (cased) which is Pretrained model on English language using a masked language modeling (MLM) objective
In preprocessing, I have used the following steps to clean the data:
- Remove Null Values
- Remove all HTML tags, Mail and URL
- Remove all the single characters from the start
- Remove emoticons and emojis
- Remove all the special characters
- Remove all single characters from the start
- Substituting multiple spaces with single space
- Remove all the single characters from the start
I have used the following steps to tokenize the data:
- Tokenize the sentences
- Pad and truncate all the sentences to a maximum length of 256
- Create input ids and attention masks
I have used the following steps to train the model:
- Set the maximum length of the sentence to 256
- Set the batch size to 16
- Set the number of epochs to 5
- Set the learning rate to 1e-5, decay=
- Set the epsilon to 1e-6
- Set the AdamW optimizer
Installation
- Python
- Pandas
- Numpy
- Tensorflow
- Transformers