biolemmatizer: A Java repository from chanokim

    Colorado Computational Pharmacology, University of Colorado School of Medicine  October 22, 2013



The BioLemmatizer is a lemmatization tool for the morphological analysis of 
biomedical literature. It is tailored to the biological domain through 
integration of several published lexical resources related to molecular 
biology. It focuses on the inflectional morphology of English, including the 
plural form of nouns, the conjugations of verbs, and the comparative and 
superlative form of adjectives and adverbs. The BioLemmatizer retrieves lemmas 
based on the use of a lexicon that covers an exhaustive list of inflected word 
forms and their corresponding lemmas in both general English and the biomedical 
domain, as well as a set of rules that generalize morphological transformations 
to heuristically handle words that are not encountered in the lexicon.

This directory contains the software developed by Haibin Liu 
<Haibin.Liu@ucdenver.edu>, William A Baumgartner Jr <William.Baumgartner@ucdenver.edu> 
and Karin Verspoor <Karin.Verspoor@ucdenver.edu>. The BioLemmatizer is developed 
in Java and is released as open source software to the NLP and text mining 
research communities to be used for research purposes only (see section 8 below 
for copyright information). It can be downloaded via http://biolemmatizer.sourceforge.net. 
If you make any changes, the authors would appreciate it if you can send the details 
of what you have done. A Perl module of the BioLemmatizer Lingua::En::BioLemmatizer 
is developed by Tom Christiansen <tchrist@perl.com> and released on CPAN at 
http://search.cpan.org/perldoc?Lingua::EN::BioLemmatizer

Note: The BioLemmatizer code requires Java version 6 or greater.

1. Files and Folders
---------------------

  README.txt                          this file
  
  biolemmatizer-1.2.tar.gz	      the source code, resources, and license for the BioLemmatizer
  
  biolemmatizer-core-1.2-jar-with-dependencies.jar
                                      Jar file for the biolemmatizer-core module, including all
                                      required dependencies
  
  lexicon.lex.gz                      contains the full lexicon used by the BioLemmatizer
  
  biolemmatizer-eval-datasets.tar.gz  contains all the experimental datasets (CRAFT, OED, LLL), 
                                      and the gold and silver annotations used for testing the
                                      BioLemmatizer (see section 8 for detailed description)                                                                               
     

2. Usage
--------

Set the MAVEN_OPTS environment variable to provide the JVM enough memory to load 
the lexicon file (this command only needs to be executed once):
  export MAVEN_OPTS="-Xmx1G"
  
Lemmatize one single input string:
  mvn -f biolemmatizer-core/pom.xml exec:java -Dexec.mainClass="edu.ucdenver.ccp.nlp.biolemmatizer.BioLemmatizer" -Dexec.args="<input string> [POS tag]"

Lemmatize input strings in a file, output lemmas to a different file:
  mvn -f biolemmatizer-core/pom.xml exec:java -Dexec.mainClass="edu.ucdenver.ccp.nlp.biolemmatizer.BioLemmatizer" -Dexec.args="-i <input file name> -o <output file name>"

Run the BioLemmatizer in interactive mode, i.e. lemmatize input strings from standard input (exit when an empty line is used as input):
  mvn -f biolemmatizer-core/pom.xml exec:java -Dexec.mainClass="edu.ucdenver.ccp.nlp.biolemmatizer.BioLemmatizer" -Dexec.args="-t"

Input parameter descriptions:
  -f VAL  :    optional path to a lexicon file. If not set, the default lexicon 
	           available on the classpath is used
	 
  -l      :    By default, the BioLemmatizer output contains the resulting 
               lemma, the POS tag of the input string and the tagset name of the POS tag. 
               The option -l returns only the lemma and ignores other information.
               
  -a      :    to invoke the Americanization process that normalizes common British English 
               spellings into American English spellings, and retrieves corresponding lemmas.
               This is achieved based on a mapping list and some deterministic rules.
               For instance: the lemma of "haemangioblastoma" will be "hemangioblastoma".        
 
  POS tag :    The POS tag associated with the input string. 
               It is optional and is expected to follow the Penn Treebank tagset. 

  -i VAL  :    to specify the input file name 
  -o VAL  :    to specify the output file name 
  -t      :    to invoke the interactive mode. With this mode, the BioLemmatizer can be easily
               integrated into applications written in other languages, such as Perl. To exit
               the interactive mode enter a blank line.                    

See the following sections for specifications of input and output formats, and examples of usage.



3. BioLemmatizer Input Specification
-------------------------------------
The BioLemmatizer can be run to lemmatize a single input string or a batch 
of strings submitted in an input file.

Character encoding for all input is assumed to be UTF-8.

(a) Each input token is expected to be of the form <input string> [POS tag]. For examples:

          roles NNS or quantitated VBD

The POS tag associated with the input string is expected to follow the widely 
used Penn Treebank tagset. The POS information is optional. When it is not 
given in the input, the BioLemmatizer returns lemmas for all possible parts 
of speech, in terms of both POS tagsets (NUPOS and Penn Treebank tagsets) 
represented in the lexicon. Our assumption is that without knowing the word 
context, the lemmatizer should return all possible lemmas and allow the user 
or calling application to resolve the ambiguities. 

(b) Each input file is expected to be in the lemmatization format with or without blank lines.

The lemmatization format requires 2 fields:

* FORM: input string 
* POSTAG: POS tag 

Each field is delimited by a tab character ('\t'). Each sentence is delimited 
by a blank line. The POS tag is expected to follow the Penn Treebank tagset. 
Likewise, the POS information is optional. For example:

Bmp7	NN
knockout	NN
mice	NNS
do	VBP
not	RB
show	VB
any	DT
defect	NN
in	IN
limb	NN
polarity	NN
.	.

Bmp2	NN
mutant	NN
embryos	NNS
die	VBP
too	RB
early	RB
to	TO
assess	VB
their	PRP$
limb	NN
phenotypes	NNS
.	.



4. BioLemmatizer Output Specification
--------------------------------------
By default, the BioLemmatizer output consists of the resulting lemma, the POS 
tag of the input string and the tagset name of the POS tag. For example, for 
the input "quantitated VBD", the BioLemmatizer produces "quantitate VBD 
PennPOS". If the POS information is not provided in the input, the 
BioLemmatizer returns lemmas for all possible parts of speech across all POS 
tagsets, separated by a separator "||". For example, for the input 
"diminished", the output is "diminish VBD PennPOS||diminished JJ PennPOS".

BioLemmatizer output is encoded using UTF-8.

The option -l is provided to have the BioLemmatizer return only the lemma in 
the output. With the option -l, the output for the above examples would be 
"quantitate" and "diminish||diminished".

If the input is a file, the resulting lemma is inserted as a new field in the 
output file, delimited by a tab character ('\t'). For example:

Bmp7	NN	Bmp7
knockout	NN	knockout
mice	NNS	mouse
do	VBP	do
not	RB	not
show	VB	show
any	DT	any
defect	NN	defect
in	IN	in
limb	NN	limb
polarity	NN	polarity
.	.	.

Bmp2	NN	Bmp2
mutant	NN	mutant
embryos	NNS	embryo
die	VBP	die
too	RB	too
early	RB	early
to	TO	to
assess	VB	assess
their	PRP$	their
limb	NN	limb
phenotypes	NNS	phenotype
.	.	.



5. Usage Examples (shown using executable jar available in biolemmatizer-core/target/ directory after the project is built)
---------------------------------------------------------------------------------------------------------------------------

(a) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar catalyses NNS

    =>   catalysis NNS PennPOS

(b) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar -l catalyses NNS 

    =>   catalysis

(c) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar running

    =>   run vvg NUPOS||running JJ PennPOS||run j-vvg NUPOS||run n-vvg NUPOS||running NN PennPOS

(d) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar -l running

    =>   run||running

(e) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar -t
    running
    =>   run vvg NUPOS||running JJ PennPOS||run VBG PennPOS||run j-vvg NUPOS||run n-vvg NUPOS||running NN PennPOS
    catalyses NNS
    =>   catalysis NNS PennPOS

(f) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar -l -t
    running
    =>   run||running
    catalyses NNS
    =>   catalysis

(g) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar -i inputfile -o outputfile

(h) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar -l -i inputfile -o outputfile


See the above sections "BioLemmatizer Input Specification" and "BioLemmatizer 
Output Specification" for the guideline of the format of input and output files.



6. Lexical data from the BioLexicon
----------------------------------------------------
The BioLemmatizer integrates lexical resources from three sources: MorphAdorner, 
the GENIA tagger and the BioLexicon database. Since the BioLexicon morphological 
data used in the BioLemmatizer is included in the publicly available part of the
data in the BioLexicon (EBI term repository), we are able to redistribute it in 
the public release of the full version of the BioLemmatizer. For the original 
morphological data in the BioLexicon database, please refer to the following 
BioLexicon publication and the download link of the freely available data in the 
BioLexicon.

Thompson P, McNaught J, Montemagni S, Calzolari N, del Gratta R, Lee V, Marchi S, 
Monachini M, Pezik P, Quochi V, Rupp C, Sasaki Y, Venturi G, Rebholz-Schuhmann D, 
Ananiadou S: The BioLexicon: a large-scale terminological resource for biomedical 
text mining. BMC Bioinformatics 2011, 12:397.

Download link of the EBI term repository of the BioLexicon:
http://www.ebi.ac.uk/Rebholz-srv/BioLexicon/biolexicon.html

ELRA link of the full version of the BioLexicon
http://catalog.elra.info/product_info.php?products_id=1113



7. Performance comparison with/without BioLexicon data
-------------------------------------------------------
Please refer to the following publication for more 
detailed performance comparison.

Haibin Liu, Tom Christiansen, William A Baumgartner Jr, and Karin Verspoor
BioLemmatizer: a lemmatization tool for morphological processing of biomedical text
Journal of Biomedical Semantics 2012, 3:3.

After the experiments reported in the publication, we collected all false positive 
lemmas we encountered, and we have fixed nearly all of them, either by adding an 
entry to the BioLemmatizer lexicon or by modifying the rules of detachment, in some 
cases adding the lexicon validation constraint.

Here we provide the lemmatization results on three of our evaluation datasets to 
highlight the performance difference for the BioLemmatizer with and without 
the BioLexicon data, and the tool achieving the second best performance among 
9 lemmatizers we tested. 

Evaluation on silver consensus set of CRAFT
                                  Recall                    Precision                F-score
ExcludeBioLexicon                 99.56% (5836/5862)	    99.56% (5836/5862)	     99.56%
IncludeBioLexicon                 100% (5862/5862)          100% (5862/5862)         100%
Second best (morpha tool)         100% (5862/5862)          100% (5862/5862)         100%

Evaluation on gold difference set of CRAFT
                                  Recall                    Precision                F-score
ExcludeBioLexicon                 94.30% (546/579)	    94.30% (546/579)	     94.30%
IncludeBioLexicon                 99.65% (577/579)          99.65% (577/579)         99.65%
Second best (MorphaAdorner)       81.87% (474/579)	    82.29% (474/576)	     82.08%

Evaluation on gold OED set
                                  Recall                    Precision                F-score
ExcludeBioLexicon                 82.55% (667/808)	    82.55% (667/808)	     82.55%  
IncludeBioLexicon                 84.65% (684/808)          84.65% (684/808)         84.65%
Second best (morpha tool)         75.74% (612/808)          75.74% (612/808)         75.74%

Currently, for the performance on biomedical text (the CRAFT set), the 
overall lemmatization accuracy of the public release of BioLemmatizer is 99.9%
(the full version of BioLemmatizer, including the BioLexicon data). The version 
of the BioLexicon database used in our experiments is: Version of May 22nd, 2009.



8. Description of contents of biolemmatizer-eval-datasets.tar.gz
----------------------------------------------------------------

  CRAFT_development_data              subset of the CRAFT corpus, containing 7 full-text articles 
  CRAFT_consensus_silver              consensus set of CRAFT_development_data (excluding adverbs), 
                                      representing agreement among 6 lemmatizers, to form a 
                                      "silver lemma standard"
  CRAFT_difference_gold               gold lemma annotation of the set of disagreements among 9 lemmatizers                                   
  OED_gold                            gold lemma annotation of the OED (Oxford English Dictionary) set
  LLL_gold                            gold lemma annotation of the LLL05 set, curated with automatically 
                                      generated POS information
  LLL_gold_updated                    LLL_gold with fixed annotation on incorrect or inconsistent 
                                      instances and task-specific normalizations 



9. Copyright and License
------------------------------------
The software is released under the New BSD license 
(http://www.opensource.org/licenses/bsd-license.php).

Copyright (c) 2012, Regents of the University of Colorado
 All rights reserved.

 Redistribution and use in source and binary forms, with or without modification, 
 are permitted provided that the following conditions are met:

  * Redistributions of source code must retain the above copyright notice, this 
    list of conditions and the following disclaimer.
   
  * Redistributions in binary form must reproduce the above copyright notice, 
    this list of conditions and the following disclaimer in the documentation 
    and/or other materials provided with the distribution.
   
  * Neither the name of the University of Colorado nor the names of its 
    contributors may be used to endorse or promote products derived from this 
    software without specific prior written permission.

 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 
 ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 
 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 
 DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
 ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 
 (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 
 LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON 
 ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 
 SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Any documentation, advertising materials, publications and other materials 
related to redistribution and use must acknowledge that the software was 
developed by Haibin Liu <Haibin.Liu@ucdenver.edu>, William A Baumgartner Jr 
<William.Baumgartner@ucdenver.edu> and Karin Verspoor <Karin.Verspoor@ucdenver.edu> 
and must refer to the following publication:

Haibin Liu, Tom Christiansen, William A Baumgartner Jr, and Karin Verspoor
BioLemmatizer: a lemmatization tool for morphological processing of biomedical text
Journal of Biomedical Semantics 2012, 3:3.



10. Incorporated software and resources
---------------------------------------
This software incorporates the MorphAdorner software (http://morphadorner.northwestern.edu/), 
lexical resources from the BioLexicon database (http://www.ebi.ac.uk/Rebholz-srv/BioLexicon/biolexicon.html)
and the GENIA Tagger (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/). 
We redistribute these software and resources here.

MorphAdorner license:

The MorphAdorner source code and data files fall under the following NCSA style license. 
Some of the incorporated code and data fall under different licenses as noted in the 
section third-party licenses below.
Copyright (c) 2006-2009 by Northwestern University. 
All rights reserved.
Developed by:
Academic and Research Technologies
Northwestern University
http://www.it.northwestern.edu/about/departments/at/

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
associated documentation files (the "Software"), to deal with the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the
following conditions:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and
the following disclaimers.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions
and the following disclaimers in the documentation and/or other materials provided with the
distribution.
3. Neither the names of Academic and Research Technologies, Northwestern University, nor the
names of its contributors may be used to endorse or promote products derived from this
Software without specific prior written permission.

BioLexicon database citation:

Thompson P, McNaught J, Montemagni S, Calzolari N, del Gratta R, Lee V, Marchi S, 
Monachini M, Pezik P, Quochi V, Rupp C, Sasaki Y, Venturi G, Rebholz-Schuhmann D, 
Ananiadou S: The BioLexicon: a large-scale terminological resource for biomedical 
text mining. BMC Bioinformatics 2011, 12:397.

ELRA link of the full version of the BioLexicon
http://catalog.elra.info/product_info.php?products_id=1113

Download link of the EBI term repository of the BioLexicon:
http://www.ebi.ac.uk/Rebholz-srv/BioLexicon/biolexicon.html

GENIA Tagger License

Copyright (c) 2005, Tsujii Laboratory, The University of Tokyo
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted for non-commercial purposes provided
that the following conditions are met:

- Redistributions of source code must retain the above copyright
  notice, this list of conditions and the following disclaimer.

- Redistributions in binary form must reproduce the above copyright
  notice, this list of conditions and the following disclaimer in the
  documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Since The GENIA Tagger uses a dictionary in WordNet for morphological analysis,
the corresponding WordNet license is also included here. 

WordNet Release 2.1

This software and database is being provided to you, the LICENSEE, by
Princeton University under the following license.  By obtaining, using
and/or copying this software and database, you agree that you have
read, understood, and will comply with these terms and conditions.:

Permission to use, copy, modify and distribute this software and
database and its documentation for any purpose and without fee or
royalty is hereby granted, provided that you agree to comply with
the following copyright notice and statements, including the disclaimer,
and that the same appear on ALL copies of the software, database and
documentation, including modifications that you make for internal
use or for distribution.

WordNet 2.1 Copyright 2005 by Princeton University.  All rights reserved.

THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON
UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON
UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-
ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE
OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT
INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR
OTHER RIGHTS.

The name of Princeton University or Princeton may not be used in
advertising or publicity pertaining to distribution of the software
and/or database.  Title to copyright in this software, database and
any associated documentation shall at all times remain with
Princeton University and LICENSEE agrees to preserve same.



11. Acknowledgements
------------------------------------
Many thanks to Professor Lawrence Hunter, Helen Johnson, Kevin B. Cohen, 
and other members of the Colorado Computational Pharmacology group for 
providing valuable effort and suggestions related to this work.
chanokim/biolemmatizer