/BIO-to-BIOLU

Changes the encoding of CoNLL-03 NER datasets from BIO to BIOLU

Primary LanguagePython

BIO-to-BIOLU

The CoNLL 2003 NER dataset is annotated using the BIO labeling scheme. Each word is labelled in accordance with its location relative to a named entity (NE), using the three following markers:

  • B- for the first token of a NE,
  • I- for tokens inside NE's,
  • O- for tokens outside any NE.

A labelling scheme shown to outperform BIO is the BIOLU scheme [Ratinov and Roth, 2009], where two additional markers are included:

  • L- for the last tokens of NE's,
  • U- for unit length NE's.

This Python script converts a BIO-encoded file to BIOLU.

Usage

Run the following in the command line, where you specify the path of the original BIO encoded file and the name of your converted file.

python biolu_encode.py bio_path biolu_path

Tested for Python 3.6.

Examples

eng-biolu.toy is the result when converting eng.toy