MLEE: A repository from Anthony9624

This package contains the data of the MLEE (Multi-Level Event Extraction)
corpus, version 1.0.2 (revision 1).

This README provides a brief overview of the package contents. See the
LICENSE file included in the package for the data license, the
manuscript referenced at the bottom of this file for an introduction
of the corpus, and the project homepage

    http://www.nactem.ac.uk/MLEE/

for data visualizations, supplementary data and more information.


CONTENTS


This package contains the following:

* README:       this file
* LICENSE:      licenses of the texts and annotations
* standoff:     corpus data in standoff format (all annotations)
* conll:        corpus data in CoNLL format (entity annotations only)

Both of the standoff/ and conll/ directories contain the following
subdirectories:

* development:  development split of data, excluding test set
* test:         test split of the data, including all data
* full:         full corpus data

Each of the development/ and test/ directories further contain the
following:

* train:        training data for development/final test
* test:         test data for development/final test

The format and suggested use of the files contained in these
directories is explained below.


FORMAT

The corpus data is provided in two formats: BioNLP Shared Task-style
standoff format, and CoNLL shared task-style BIO-format.


Standoff format

The data in the standoff/ directory are provided in the standoff
format used by the brat annotation tool (http://brat.nlplab.org/). For
details of the format, see the documentation page
http://brat.nlplab.org/standoff.html

For the full corpus data in standoff/full/, all standoff annotations
for a single text file are provided in a single file (.ann). For the
data in standoff/development/ and standoff/test/, the annotations are
split into entity annotations (.a1) and event annotations (.a2). This
is intended to faciliate event extraction experiments where entity
annotations are provided as part of the input.


CoNLL format

The data in the conll/ directory is provided in the column-formatted
BIO representation used in many reference resources for mention
detection such as that of the CoNLL shared tasks (see
e.g. http://www.cnts.ua.ac.be/conll2002/ner/). 

Each line contains four TAB-separated columns: token text, start
offset, end offset, and tag. Each tag consist of one of the letters B,
I or O (for "begin", "in", and "out"), and the type of the entity for
the B and I tags. (The offsets into the source text are provided for
reference and can be ignored for most applications.)

The entity mention detection task is to learn to predict the tags
(last column) given the token texts (first column).


EVALUATION

The corpus is intended to serve as an evaluation standard. The
proposed approach to method development and evaluation is to use the
test/ data only for final evaluation after completing method
development and parameter selection.

PLEASE NOTE: the data in the development/ and test/ directories are
not separate: the development/ data is a split of the test/train/
data.


CONTACT

For any queries relating to the corpus, please contact Sampo Pyysalo
<sampo.pyysalo@gmail.com>


CHANGELOG

* 1.0.2 (11.09.2012): first public release 


REFERENCES

The corpus is presented in the following manuscript.

* Sampo Pyysalo, Tomoko Ohta, Makoto Miwa, Han-Cheol Cho, Jun'ichi
  Tsujii and Sophia Ananiadou (2012). Event extraction across multiple
  levels of biological organization. Bioinformatics 28(18):i575-i581.

The project page is located at http://www.nactem.ac.uk/MLEE/
Anthony9624/MLEE