Clone the project note the --recursive
flag will clone the necessary submodules.
git clone --recursive git@github.com:danlamanna/i2b2tools.git
After cloning, run:
cd i2b2tools
pip install -r requirements.txt
python setup.py install
This will install the dependencies as well.
i2b2tools uses the standard unittest
module.
cd i2b2tools/tests
python tests.py
Copyright 2015 Dan LaManna
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Welcome! Be sure the changes you’re incorporating are in this codebase and not part of a git submodule, particularly the lib/standoff_annotations
directory which is present here.
This is an example of an XML file that can be represented using a StandoffAnnotation object:
<deIdi2b2>
<TEXT><![CDATA[
New York Hospital
123 Main St.
New York, New York
Date: 2/20/2015
Patient John Doe presented on Friday with chest pains, etc.
John Smith, M.D.
]]></TEXT>
<TAGS>
<LOCATION id="P1" start="3" end="20" text="New York Hospital" TYPE="HOSPITAL" comment="" />
<LOCATION id="P2" start="23" end="34" text="123 Main St" TYPE="STREET" comment="" />
<LOCATION id="P3" start="38" end="46" text="New York" TYPE="CITY" comment="" />
<LOCATION id="P4" start="48" end="56" text="New York" TYPE="STATE" comment="" />
<DATE id="P5" start="68" end="77" text="2/20/2015" TYPE="DATE" comment="" />
<NAME id="P6" start="89" end="97" text="John Doe" TYPE="PATIENT" comment="" />
<NAME id="P7" start="144" end="154" text="John Smith" TYPE="DOCTOR" comment="" />
</TAGS>
</deIdi2b2>
StandoffAnnotation’s provide many helpers, such as giving access to a tokenizer for the document, altering the document text or PHI and re-saving the file to disk. Mainly however, they are what the evaluation scripts use, which can help easily determine the precision, recall, and F1 measure of a set of annotations.
This toolset offers a series of libraries which can help with evaluating performance, tokenizing text, and altering StandoffAnnotations in a systematic way.
In addition there are many helpers which assist in working with StandoffAnnotations by finding specific PHI, editing PHI, and looking at your documents more closely.
Finally, converters help in converting a StandoffAnnotation to an InlineAnnotation, and vice versa.
Any of the libraries exist under the i2b2tools.lib submodule, so for example:
from i2b2tools.lib import StandoffAnnotation
will import the StandoffAnnotation object.
For additional information, see the documentation on the authors page, i2b2_evaluation_scripts.
By default, the regular expression for tokenizing is (\w+)
, say you wanted to alter this to allow the “/” not to break up a token, you can change the tokenizer regular expression like so:
from i2b2tools.lib import TokenSequence
import re
TokenSequence.tokenizer_re = re.compile(r'([\w/]+)')
Rules are the backbone of postprocessors. The idea of a postprocessor is to do postprocessing to a group of StandoffAnnotations so you can evaluate the F1 measures before and after.
Ultimately a rule gets access to the StandoffAnnotation it needs to alter in some way, such as deleting PHI, editing PHI, etc. It does so by way of an action, and the action gets access to a target. See Built-in rules.
Every rule has a function which supplies a list of targets. For example, if you wanted to create a rule that could mark every token matching a regular expression as PHI, your targets function would probably return the output of re.findall
.
The action looks at a single target and does something to it. In the example of marking a token matching a regular expression as PHI, you would delete any PHI presently at the point of the target, and re-create it. (There is already a built in RegexRule which does exactly that).
The base PostProcessor can be used as is, so let’s see an example.
We want to mark all instances of John as a Person, and see how it improves our score.
from i2b2tools.lib import PostProcessor
p = PostProcessor(system_sas, gold_sas, [(RegexRule, ["(John)", "NAME", "PERSON", NameTag])])
# run our rule(s)...
p.process()
# see how the F1 measure changed
p.summary() # .59 -> .71
This takes a regular expression and what it should be deemed in terms of a tag for example, mark all instances of John/john as a person:
RegexRule, ["([Jj]ohn)", "NAME", "PERSON", NameTag]
The regex needs to conform to match_group, meaning the part of the regex that needs to be marked corresponds to a matching group in the regex.
Example being we have dates such as this:
<DATE>10/5/2015</DATE>
But in fact, we only want our PHI to match “10/5”, so we can trim it using a RemoveRegexRule as follows:
RemoveRegexRule, ["\d{1,2}\/\d{1,2}(/\d{2,4})"], 0
This merges multiple PHI into one based on a predicate function.
A good example is using helpers.predicates._trigram_name_predicate to solve an issue such as:
<NAME>Edgar</NAME> Allan <NAME>Poe</NAME>
This could be rectified as:
<NAME>Edgar Allan Poe</NAME>
Using a merge rule such as:
MergeRule, [3, "NAME", "POET", NameTag, _trigram_name_predicate]
Determines if a given file would constitute a valid StandoffAnnotation. It will return false if the file doesn’t exist, or if it contains invalid XML.
Determines if a given StandoffAnnotation has any PHI that overlap.
Returns a dictionary in the format of:
{"id": <StandoffAnnotation>}
This is determined by finding all filenames within dirname that pass is_valid_sa_file.
Returns a list of PHI that are present at a given offset in a StandoffAnnotation.
So in the instance of the following document, denoted as sa
:
<deIdi2b2>
<TEXT><![CDATA[Oh hey there Jeff. How are you doing today, 2/21/2015?]]></TEXT>
<TAGS>
<NAME id="P1" start="13" end="17" text="Jeff" TYPE="NAME" comment=""/>
<DATE id="P2" start="44" end="53" text="2/21/2015" TYPE="DATE" comment=""/>
</TAGS>
</deIdi2b2>
phi_at_offset(sa, 14)
would yield the following:
[<NameTag: NAME, 13, 17, NAME s:13 e:17>]
Using our above sa
, we can find all PHI existing between a range.
phi_within_range(sa, 17, 44)
would yield:
[<NameTag: NAME, 13, 17, NAME s:13 e:17>,
<DateTag: DATE, 44, 53, DATE s:44 e:53>]
Allows filtering of PHI on a StandoffAnnotation based on a dictionary of attributes.
For example:
sa_filter_by_phi_attrs(sa, {"name": "DATE", "TYPE": "YEAR"})
Provides a “sliding window” of n tokens from a token sequence.
For instance, if your token sequence were:
foo bar baz.
n_tokens with an n value of 2, would yield:
[(<Token ''>, <Token 'foo'>),
(<Token 'foo'>, <Token 'bar'>),
(<Token 'bar'>, <Token 'baz'>)]
Returns a list of tuples containing each token in a token sequence of the document, and the PHI tag associated with that token, if any. This does not support StandoffAnnotation’s with overlapping PHI.
This is a mutable function, so it will in fact call StandoffAnnotation.save which will attempt to overwrite the file on disk.
So if somehow PHI that had a name of DATE were actually supposed to have a name of PHONE, you could perform this operation to a StandoffAnnotation:
remap_sa_attributes(sa, {"name": "DATE"}, {"name": "PHONE"})
Converters are one of the most helpful parts of i2b2tools, what’s imperative is that each format can be converted back and forth without anything being lost in translation (especially whitespace) - because character offsets are vital to the format.
If you create a converter, submit a pull request to get it added.
Looking at our initial document, this is what it would look like after being converted to an inline document:
<ROOT>
<HOSPITAL>New York Hospital</HOSPITAL>
<STREET>123 Main St</STREET>.
<CITY>New York</CITY>, <STATE>New York</STATE>
Date: <DATE>2/20/2015</DATE>
Patient <PATIENT>John Doe</PATIENT> presented on Friday with chest pains, etc.
<DOCTOR>John Smith</DOCTOR>, M.D.
</ROOT>
This is useful because this is output similar to what certain classifiers output, namely Carafe and Stanford NER.
For completeness’ sake - this is an example of an Inline Annotation converted to a standoff annotation:
<ROOT>
Record date: <DATE>2013-08-19</DATE>
Patient Name: <PATIENT>GOLDBERG, RUBE</PATIENT> [MRN: <MEDICALRECORD>12345</MEDICALRECORD>]
The <AGE>44</AGE> year old presented with things, and stuff.
<DOCTOR>Foo J. Bar</DOCTOR>
</ROOT>
<deIdi2b2>
<TEXT><![CDATA[
Record date: 2013-08-19
Patient Name: GOLDBERG, RUBE [MRN: 12345]
The 44 year old presented with things, and stuff.
Foo J. Bar
]]></TEXT>
<TAGS>
<DATE TYPE="DATE" comment="" end="26" id="P0" start="16" text="2013-08-19"/>
<NAME TYPE="PATIENT" comment="" end="58" id="P1" start="44" text="GOLDBERG, RUBE"/>
<ID TYPE="MEDICALRECORD" comment="" end="70" id="P2" start="65" text="12345"/>
<AGE TYPE="AGE" comment="" end="81" id="P3" start="79" text="44"/>
<NAME TYPE="DOCTOR" comment="" end="138" id="P4" start="128" text="Foo J. Bar"/>
</TAGS>
</deIdi2b2>