/EvmEval

The event mention detection and corefrene evaluators, and associated utilities (converters, validators)

Primary LanguageJavaScript

Table of Contents generated with DocToc

Event Mention Evaluation (EvmEval)

This repository conducts, file conversion, and scoring for event mention detection. It consists of the following three pieces of code:

  1. A simple converter from Brat annotation tool format to CMU detection format
  2. A scorer that can score system performance based on CMU detection format
  3. A visualizer that use Embedded Brat Viewer (not actively maintained)

To use the software, we need to prepare the CMU format annotation file from the Brat annotation output using "brat2tbf.py". The scorer can then take 2 documents in such format, one as gold standard data, one as system output. The scorer also need the token files produced by the tokenizer. The usage of these codes are described below.

Use the example shell scripts "example_run.sh" to perform all the above steps in the sample documents, if success, you will find scoring results in the example_data directory

Most utility code can be found in the util directory.

Naming Convention

The following scripts need to find corresponding files by docid and file extension, so the file extension will be provided exactly. The script have default values for these extensions, but may require additional argument if extensions are changed.

Here is how to find the extension:

For brat annotation files, they normally have the following name:

<docid>.ann

In such case, the file extension is ".ann", the converter assume this as the default extension. If not, change it with "-ae" argument

In the past evaluations, tokenization tables are provided, for tokenization table, they normally have the following name:

<docid>.tab

In such case, the file extension is ".tab", both the converter and scorer assume this as a default extension. If not, change them with "-te" argument.

scorer.py

The current scorer can score event mention detection and coreference based on the (.tbf) format. The current scorer consider mentions as character based. The old token based evaluation is still available, which will be triggered when the token table files are provided.

Features

  1. Produce F1-like scoring by mapping system mentions to gold standard mentions, read the scoring documentation for more details.
  2. Be able to produce a comparison output indicating system and gold standard differences: a. A text based comparison output (-d option) b. A web based comparison output using Brat's embedded visualization (-v option)
  3. If specified, it will generate temporary conll format files, and use the conll reference-scorer to produce coreference scores
  4. Be able to conduct temporal evaluation as well if specified with the "-a" argument.
  5. Support discontinuous span mentions.

Discontinuous Span Support

If your annotated data contains mentions that covers discontinous spans, it can be represented in TBF files and scored correctly. For example:

He made his way home. 

The two tokens "made way" may be annotated as a mention, instead of the full "made his way" span. In this case, one can specify the span as: 4,8;13,16 , where 4,8 is the span for "made" and 13,16 is the span for "way". These two spans are connected by a semicolon, the full line will be something like the following:

system_a example1        E1      4,8;13,16       made way        Movement_Transport-Person       Actual

Usage

usage: scorer_v1.8.py [-h] -g GOLD -s SYSTEM [-d COMPARISON_OUTPUT]
                      [-o OUTPUT] [-c COREF] [-a SEQUENCING] [-t TOKEN_PATH]
                      [-m COREF_MAPPING] [-of OFFSET_FIELD]
                      [-te TOKEN_TABLE_EXTENSION] [-ct COREFERENCE_THRESHOLD]
                      [-b] [--eval_mode {char,token}] [-wl TYPE_WHITE_LIST]
                      [-dn DOC_ID_TO_EVAL]

Event mention scorer, provides support to Event Nugget scoring, Event Coreference and Event Sequencing scoring.

core arguments:
  -g GOLD, --gold GOLD  Golden Standard
  -s SYSTEM, --system SYSTEM System output

optional arguments:
  -d COMPARISON_OUTPUT, --comparison_output COMPARISON_OUTPUT
                        Compare and help show the difference between system
                        and gold
  -o OUTPUT, --output OUTPUT
                        Optional evaluation result redirects, put eval result
                        to file
  -c COREF, --coref COREF
                        Eval Coreference result output, need to put the
                        referenceconll coref scorer in the same folder with
                        this scorer
  -a SEQUENCING, --sequencing SEQUENCING
                        Eval Event sequencing result output (After and
                        Subevent)
  -t TOKEN_PATH, --token_path TOKEN_PATH
                        Path to the directory containing the token mappings
                        file, only used in token mode.
  -m COREF_MAPPING, --coref_mapping COREF_MAPPING
                        Which mapping will be used to perform coreference
                        mapping.
  -of OFFSET_FIELD, --offset_field OFFSET_FIELD
                        A pair of integer indicates which column we should
                        read the offset in the token mapping file, index
                        startsat 0, default value will be [2, 3]
  -te TOKEN_TABLE_EXTENSION, --token_table_extension TOKEN_TABLE_EXTENSION
                        any extension appended after docid of token table
                        files. Default is [.tab], only used in token mode.
  -ct COREFERENCE_THRESHOLD, --coreference_threshold COREFERENCE_THRESHOLD
                        Threshold for coreference mention mapping
  -b, --debug           turn debug mode on
  --eval_mode {char,token}
                        Use Span or Token mode. The Span mode will take a span
                        as range [start:end], while the Token mode consider
                        each token is provided as a single id.
  -wl TYPE_WHITE_LIST, --type_white_list TYPE_WHITE_LIST
                        Provide a file, where each line list a mention type
                        subtype pair to be evaluated. Types that are out of
                        this white list will be ignored.
  -dn DOC_ID_TO_EVAL, --doc_id_to_eval DOC_ID_TO_EVAL
                        Provide one single doc id to evaluate.

validator.py

The validator check whether the supplied "tbf" file follows assumed structure . The validator will exit at status 255 if any errors are found, validation logs will be written at the same directory of the validator with "errlog" as extension.

Usage

usage: validator.py [-h] -s SYSTEM [-tm] [-t TOKEN_PATH] [-of OFFSET_FIELD]
                    [-te TOKEN_TABLE_EXTENSION] [-wc WORD_COUNT_FILE]
                    [-ty TYPE_FILE] [-b]

The validator check whether the supplied 'tbf' file follows assumed structure.
The validator will exit at status 255 if any errors are found, validation
logs will be written at the same directory of the validator with 'errlog' as
extension.

core arguments:
  -s SYSTEM, --system SYSTEM System output

optional arguments:
  -h, --help            show this help message and exit
  -tm, --token_mode     Token mode, default is false.
  -t TOKEN_PATH, --token_path TOKEN_PATH
                        Path to the directory containing the token mappings
                        file, only in token mode.
  -of OFFSET_FIELD, --offset_field OFFSET_FIELD
                        A pair of integer indicates which column we should
                        read the offset in the token mapping file, index
                        starts at 0, default value will be [2, 3]. Only used
                        in token mode.
  -te TOKEN_TABLE_EXTENSION, --token_table_extension TOKEN_TABLE_EXTENSION
                        any extension appended after docid of token table
                        files. Default is [.tab]
  -wc WORD_COUNT_FILE, --word_count_file WORD_COUNT_FILE
                        A word count file that can be used to help validation,
                        such as the character_counts.tsv in LDC2016E64.
  -ty TYPE_FILE, --type_file TYPE_FILE
                        If provided, the validator will check whether the type
                        subtype pair is valid.
  -b, --debug           turn debug mode on

brat2tbf.py

This is a tool that converts Brat Annotation format to TBF format. We currently try to make as little assumption as possible. However, in order to resolve coreference transitive redirect automatically, the relation name for coreference must be named as "Coreference". We also develop for event coreference only.

Features

  1. ID convention

The default set up follows Brat v1.3 ID convention:

  • T: text-bound annotation
  • R: relation
  • E: event
  • A: attribute
  • M: modification (alias for attribute, for backward compatibility)
  • N: normalization [new in v1.3 of Brat]
  • #: note

Further development might allow customized ID convention.

  1. This code only scan and detect event mentions and its attributes. Event arguments and entities are currently not handled. Annotations other than Event Mention (with its attributes and Text Spans) will be ignored, which means, it will only read "E" annotations and its related attributes.

  2. Discontinuous text-bound annotations will be supported

Usage

brat2tokenFormat.py [-h] (-d DIR | -f FILE) -t TOKENPATH [-o OUT]
                       [-oe EXT] [-i EID] [-w] [-te TOKEN_TABLE_EXTENSION]
                       [-ae ANNOTATION_EXTENSION] [-b]

This converter converts Brat annotation files to one single token based event mention description file (CMU format). It accepts a single file name or a directory name that contains the Brat annotation output. The converter also requires token offset files that shares the same name with the annotation file, with extension .txt.tab. The converter will search for the token file in the directory specified by '-t' argument

Required Arguments:
  -d DIR, --dir DIR     directory of the annotation files
  -f FILE, --file FILE  name of one annotation file
  -t TOKENPATH, --tokenPath TOKENPATH
                    directory to search for the corresponding token files

Optional arguments:
  -h, --help            show this help message and exit
  -o OUT, --out OUT     output path, 'converted' in the current path by
						default
  -oe EXT, --ext EXT    output extension, 'tbf' by default
  -i EID, --eid EID     an engine id that will appears at each line of the
						output file. 'brat_conversion' will be used by default
  -w, --overwrite       force overwrite existing output file
  -te TOKEN_TABLE_EXTENSION, --token_table_extension TOKEN_TABLE_EXTENSION
						any extension appended after docid of token table
						files. Default is .txt.tab
  -ae ANNOTATION_EXTENSION, --annotation_extension ANNOTATION_EXTENSION
						any extension appended after docid of annotation
						files. Default is .tkn.ann
  -b, --debug           turn debug mode on

LDC-XML-to-Brat converter

This software converts LDC's XML format for the TAC KBP 2015 Event Nugget task to the Brat format. More specifically, it converts LDC's event nuggets and coreferences to events and coreference links that can be viewed via the Brat web interface. Brat annotation configurations for output are available at directory src/main/resources/.

Requirements of the software

The software requires Java 1.8. See pom.xml for other dependencies.

How to run the software

You can see its usage with the following command:

$ java -jar target/converter-1.0.3-jar-with-dependencies.jar -h
Option                            Description              
------                            -----------              
-a <annotation dir>               annotation directory       
--ae <annotation file extension>  annotation file extension  
-d                                whether to detag text      
-h                                help                       
-i <input mode>                   input mode ("event-nugget")
-o <output dir>                   output directory           
-t <text dir>                     text directory             
--te <text file extension>        text file extension        

Token File Maker

Requirements of the file maker

The software requires Java 1.8. A precompiled jar locates at bin directory. To compile the project from source you will also need Maven 2.7+.

Prerequisites

Our tokenizer implementation is based on the tokenizer in the Stanford CoreNLP tool . The software is implemented in Java, and its requirements are as follows:

  1. Java 1.8
  2. The same number of text files and brat annotation files (*.ann) with the same file base name

Usage

java -jar bin/token-file-maker-1.0.3-jar-with-dependencies.jar -a <annotation> -e <extension> [-h] -o <output> [-s <separator>] -t <text>
    -a <annotation>   annotation directory
    -e <extension>    text file extension
    -h                print this message
    -o <output>       output directory
    -s <separator>    separator chars for tokenization
    -t <text>         text directory

Tokenization table files format

These are tab-delimited files which map the tokens to their tokenized files. A mapping table contains 3 columns for each row, and the rows contain an orderd listing of the document's tokens. The columns are:

  • token_id: A string of "t" followed by a token-number beginning at 0
  • token_str: The literal string of a given-token
  • tkn_begin: Index of the token's first character in the tkn file
  • tkn_end: Index of the token's last character in the tkn file

Please note that all 4 fields are required and will be used:

  • The converter will use token_id, tkn_begin, tkn_end to convert characters to token id
  • The scorer will use the token_str to detect invisible words

The tokenization table files are created using our automatic tool, which wraps the Stanford tokenizer and provide boundary checks.

visualize.py

The visualization is provided as a mechanism to compare different output, which is optional and can be ignored if one is only interested in the scores. This code maybe update frequently. Please refer to the command line "-h" for detailed instructions.

The visualize code represent mention differences in JSON, which is then passed to Embedded Brat .

Recent changes make visualizing clusters possible by creating additional JSON object. When enabled, there will be a cluster selector on the webpage, one could select the cluster and all other event mentions will hide.

A Note about visualization

The visualization mapping does not fully reflect the scoring process, it is just a mean to help compare the data. Note that there are up to 2^k different way of aligning the mentions, where k is the number of attributes. The input to the visualization system is the most basic mapping (span only). It need not capture the true mapping of mention type or realis status because several mapping options are identical in span only mapping, the visualization system simply choose whichever comes first.

Text Base Visualization

The text based Visualization can be generated using the "scorer.py", by supplying the "-d" argument. The format is straightforward, a text document is produced for comparison. The annotation of both systems are displayed in one line, separated by "|"

Web Base Visualization

The web base visualization takes the text visualization file, then:

  1. convert them to Brat Embedded JSON format and store it at the visualization folder (visualization/json)
  2. It will start a server at the visualization folder using localhost:8000
  3. Now user can browse the locally hosted site for comparison
  4. User can stop the server when done, and restart it at anytime using "start.sh", it is no longer necessary to regenerate the JSON data if one only wish to use the old ones

Usage

usage: visualize.py [-h] -d COMPARISON_OUTPUT -t TOKENPATH [-x TEXT]
                [-v VISUALIZATION_HTML_PATH] [-of OFFSET_FIELD]
                [-te TOKEN_TABLE_EXTENSION] [-se SOURCE_FILE_EXTENSION]

Mention visualizer, will create a side-by-side embedded visualization from the mapping

Required Arguments:
  -d COMPARISON_OUTPUT, --comparison_output COMPARISON_OUTPUT
                        The comparison output file between system and gold,
                        used to recover the mapping
  -t TOKENPATH, --tokenPath TOKENPATH
                        Path to the directory containing the token mappings
                        file
Optional Arguments:
  -h, --help            show this help message and exit                    
  -x TEXT, --text TEXT  Path to the directory containing the original text
  -v VISUALIZATION_HTML_PATH, --visualization_html_path VISUALIZATION_HTML_PATH
                        The Path to find visualization web pages, default path
                        is [visualization]
  -of OFFSET_FIELD, --offset_field OFFSET_FIELD
                        A pair of integer indicates which column we should
                        read the offset in the token mapping file, index
                        startsat 0, default value will be [2, 3]
  -te TOKEN_TABLE_EXTENSION, --token_table_extension TOKEN_TABLE_EXTENSION
                        any extension appended after docid of token table
                        files. Default is [.txt.tab]
  -se SOURCE_FILE_EXTENSION, --source_file_extension SOURCE_FILE_EXTENSION
                        any extension appended after docid of source
                        files.Default is [.tkn.txt]