NER Error Analyzer

Quick Start

from nlu.error import *
from nlu.parser import *


cols_format = [{'type': 'predict', 'col_num': 1, 'tagger': 'ner'},
                {'type': 'gold', 'col_num': 2, 'tagger': 'ner'}]

parser = ConllParser('testb.pred.gold', cols_format)

parser.obtain_statistics(entity_stat=True, source='predict')

parser.obtain_statistics(entity_stat=True, source='gold')

parser.set_entity_mentions()

NERErrorAnnotator.annotate(parser)

parser.print_corrects()

parser.print_all_errors()

parser.error_overall_stats()

see the section Input Format below to know what the input format is

Usage

import

from nlu.error import *
from nlu.parser import *

Create a `ConllParser` instance first with the input of the file path with specifying the column number in `cols_format` field

ConllParser(filepath)

cols_format = [{'type': 'predict', 'col_num': 1, 'tagger': 'ner'},
                {'type': 'gold', 'col_num': 2, 'tagger': 'ner'}]

parser = ConllParser('testb.pred.gold', cols_format)

obtain the basic statistics by `obtain_statistics()` method

parser.obtain_statistics(entity_stat=True, source='predict')

parser.obtain_statistics(entity_stat=True, source='gold')

To "Annotate" NER Errors in the documents inside ConllParser

NERErrorAnnotator.annotate(parser)

To print out all corrects/errors, use

parser.print_corrects() or parser.print_all_errors()

or use the function error_overall_stats() method to get the stats

Input File Format

The input file format of ConllParser is following the column format used by Conll03.

For example,

Natural I-ORG O
Language I-ORG O
Laboratory I-ORG I-ORG
...

where the first column is the text, the second and the third are the predicted and the ground truth tag respectively, where the order can be specified in the keyword cols_format in ConllParser in instantialization:

cols_format = [{'type': 'predict', 'col_num': 1, 'tagger': 'ner'},
               {'type': 'gold', 'col_num': 2, 'tagger': 'ner'}]  # col_num starts from 0

I recommend to use shell command awk '{print $x}' filepath to obtain the x-th column, like awk '{print $4} filepath' to obtain the 4-th column.

And use paste file1.txt file2.txt to concatenate two files.