/NER-sklearn-crfsuite-validation

Algorithm for classifier training **sklearn-crfsuite**

Primary LanguagePython

NER-sklearn-crfsuite-validation

Training algorithm for the sklearn-crfsuite classifier

The model used to train entity recognition was Conditional Random Fields (CRF), available in the Scikit-learn library set. This CRF model makes it possible to classify previously determined entities using statistical approaches and machine learning resources that take into account the order in which the words are arranged in the posts. In this CRF model training, we used the file generated in Doccano (see part 2), converting it from JSONL to IOB (also known as CONLL2003). A markup format used in tokens for clustering tasks, used to recognize named entities. The IOB markup system contains marks in the format:

    B - (Beginning): for the word in the initial block;
    I - (Inside): for words inside the block;
    O - (Outside): Outside of any piece.

After the transformation, we created a new collection called bio_tokens, stored them in mongodb, where the tweets are separated by each word and label, according to the IOB standard, resulting in:

    I-Drug, B-Drug, I-ADR, B-ADR and O.

The metrics used to measure the performance of the model built in this project were:

    -Precision (or Accuracy) metric is an evaluation metric based on the accuracy of its positive classification, i.e. from the moment something is classified as positive, this metric evaluates how many were actually classified correctly. In this project, precision aims to assess whether the words classified as medicines were, in fact, a medicine, for example.
    -Recall measure, on the other hand, analyzes the whole, i.e. it uses the positive truth as a reference and compares it with the positive hit. In this project, recall aims to assess how many of the samples that really belonged to an entity (ADR or MEDICINE) the algorithm actually classified into the correct entity.
    -F1-score is the harmonic mean between precision and recall.

Results:

This is part 3 of 5 of the course completion work. Developed by Beatriz Paixão and Katheleen Gregorato. See our publication on CONICT - IFSP at: https://bit.ly/3IsqULo