mallet-eval: A Perl repository from chusri

20110422 MALLET-EVAL PROJECT

GENERAL

This is a project for evaluating MALLET (MAchine Learning for 
LanguagE Toolkit). MALLET's binary and source codes are not included, 
you can check out them from this site:

    http://mallet.cs.umass.edu/

This distribution only contains sample annotation data and scripts for 
converting, importing and evaluating. The articles in the two corpora are
not included here for copyright reasons. That is why you need their cds 
for building the complete data sets.

We provide two sample corpora: Penn Treebank Sample (5% fragment of Penn
Treebank) and HIT CIR LTP Corpora Sample (10% fragemnt of the whole
Corpora)

    http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/
    http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm


BUILDING THE TRAIN AND TEST DATA FILES

In order to obtain the data files you need to perform three steps:

   1. Get a local copy of the mallet-eval repository with this command:
       
       hg clone https://mallet-eval.googlecode.com/hg/ mallet-eval 

   2. Set up $MALLET_HOME enviroment: export MALLET_HOME=/path/to/mallet/

   3. Train and test with provided Chunking, POS Tagging and Named Entity 
      Recognition data (chunking/ pos-tagging/ ner/)

   4a. (Chunking) ./conlleval < chunking/conlleval.out

   4b. (POS-Tagging) cd pos-tagging | ./verify.py   

   4c. (Named Entity Recognition) ./chn-conlleval < ner/conlleval.out

and the results are:

    processed 47377 tokens with 23852 phrases; found: 23682 phrases; correct: 21441.
    accuracy:  93.97%; precision:  90.54%; recall:  89.89%; FB1:  90.21
             ADJP: precision:  72.35%; recall:  63.93%; FB1:  67.88  387
             ADVP: precision:  78.61%; recall:  75.98%; FB1:  77.28  837
            CONJP: precision:  40.00%; recall:  44.44%; FB1:  42.11  10
             INTJ: precision:  50.00%; recall:  50.00%; FB1:  50.00  2
              LST: precision:   0.00%; recall:   0.00%; FB1:   0.00  2
               NP: precision:  90.05%; recall:  89.57%; FB1:  89.81  12355
               PP: precision:  94.97%; recall:  96.88%; FB1:  95.92  4908
              PRT: precision:  71.84%; recall:  69.81%; FB1:  70.81  103
             SBAR: precision:  89.01%; recall:  78.69%; FB1:  83.53  473
               VP: precision:  91.55%; recall:  90.51%; FB1:  91.03  4605

DATA FORMAT

The data files contain one word per line. Empty lines have been used
for marking sentence boundaries and a line containing the keyword
-DOCSTART- has been added to the beginning of each article in order
to mark article boundaries. Each non-empty line contains the following 
tokens:

   1. the current word
   2. the lemma of the word (German only)
   3. the part-of-speech (POS) tag generated by a tagger
   4. the chunk tag generated by a text chunker
   5. the named entity tag given by human annotators

The tagger and chunker for English are roughly similar to the 
ones used in the memory-based shallow parser demo available at 
http://ilk.uvt.nl/  German POS and chunk information has been 
generated by the Treetagger from the University of Stuttgart:
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
In order to simulate a real natural language processing 
environment, the POS tags and chunk tags have not been checked. 
This means that they will contain errors. If you have access to 
annotation software with a performance that is superior to this, 
you may replace these tags by yours.

The chunk tags and the named entity tags use the IOB1 format. This 
means that in general words inside entity receive the tag I-TYPE
to denote that they are Inside an entity of type TYPE. Whenever
two entities of the same type immediately follow each other, the 
first word of the second entity will receive tag B-TYPE rather than
I-TYPE in order to show that a new entity starts at that word.

The raw data has the same format as the training and test material
but the final column has been ommitted. There are word lists for 
English (extracted from the training data), German (extracted from 
the training data), and Dutch in the directory lists. Probably you 
can use the Dutch person names (PER) for English data as well. Feel 
free to use any other external data sources that you might have 
access to.


Max Lv <lch@fudan.edu.cn>
chusri/mallet-eval