NER-sklearn-crfsuite-validation

Training algorithm for the sklearn-crfsuite classifier

The model used to train entity recognition was Conditional Random Fields (CRF), available in the Scikit-learn library set. This CRF model makes it possible to classify previously determined entities using statistical approaches and machine learning resources that take into account the order in which the words are arranged in the posts. In this CRF model training, we used the file generated in Doccano (see part 2), converting it from JSONL to IOB (also known as CONLL2003). A markup format used in tokens for clustering tasks, used to recognize named entities. The IOB markup system contains marks in the format:

B - (Beginning): for the word in the initial block;

I - (Inside): for words inside the block;

O - (Outside): Outside of any piece.

After the transformation, we created a new collection called bio_tokens, stored them in mongodb, where the tweets are separated by each word and label, according to the IOB standard, resulting in:

I-Drug, B-Drug, I-ADR, B-ADR and O.

The metrics used to measure the performance of the model built in this project were:

Precision

Recall

F1-score

Results:

This is part 3 of 5 of the course completion work. Developed by Beatriz Paixão and Katheleen Gregorato. See our publication on CONICT - IFSP at: https://bit.ly/3IsqULo

Pharmacovigilance-on-Twitter/NER-sklearn-crfsuite-validation

NER-sklearn-crfsuite-validation