Install package using pip:
pip install text2term
text2term supports mapping strings specified in multiple input formats. In the first example, we map strings in a list to an ontology specified by its URL:
import text2term
dfl = text2term.map_terms(source_terms=["asthma", "acute bronchitis"],
target_ontology="http://purl.obolibrary.org/obo/mondo.owl")
There is also support for file-based input, for example a file containing a list of strings:
dff = text2term.map_terms(source_terms="test/unstruct_terms.txt",
target_ontology="http://purl.obolibrary.org/obo/mondo.owl")
or a table where we can specify the column of terms to map and the table value separator:
dff = text2term.map_terms(source_terms="test/some_table.tsv",
csv_columns=('diseases','optional_ids'), separator="\t",
target_ontology="http://purl.obolibrary.org/obo/mondo.owl")
Finally it is possible map strings in a dictionary with associated tags that are preserved in the output:
dfd = text2term.map_terms(source_terms={"asthma":"disease", "acute bronchitis":["disease", "lung"]},
target_ontology="http://purl.obolibrary.org/obo/mondo.owl")
text2term supports caching an ontology for repeated use. Here we cache an ontology and give it a name:
mondo = text2term.cache_ontology(ontology_url="http://purl.obolibrary.org/obo/mondo.owl",
ontology_acronym="MONDO")
The given name acts as a reference. Now we can map strings to the cached ontology by specifying as target_ontology
the name specified above and the flag use_cache=True
dfc = text2term.map_terms(source_terms=["asthma", "acute bronchitis"],
target_ontology="MONDO", use_cache=True)
More succinctly, we can use the returned OntologyCache
object mondo
as such:
dfo = mondo.map_terms(source_terms=["asthma", "acute bronchitis"])
To show a help message describing all arguments type into a terminal:
python text2term --help
The basic use of text2term requires a source
file containing the terms to map to a given target
ontology:
python text2term -s test/unstruct_terms.txt -t http://purl.obolibrary.org/obo/mondo.owl
Map to a local ontology and specify an output file where the mappings should be saved using -o
:
python text2term -s test/unstruct_terms.txt -t test/mondo.owl -o test/mymappings.csv
Set the minimum acceptable similarity score for mapping each given term to an ontology term using -min
:
python text2term -s test/unstruct_terms.txt -t test/mondo.owl -min 0.8
The mapped terms returned will have been determined to be 0.8 similar to their source terms in a 0-1 scale.
Exclude deprecated ontology terms (declared as such via owl:deprecated true) using -d
:
python text2term -s test/unstruct_terms.txt -t test/mondo.owl -d
Limit search to only terms whose IRIs start with any IRI given in a list specified using -iris
:
python text2term.py -s test/unstruct_terms.txt -t test/mondo.owl -iris http://purl.obolibrary.org/obo/mondo,http://identifiers.org/hgnc
While MONDO uses terms from other ontologies such as CHEBI and Uberon, the tool only considers terms whose IRIs start either with "http://purl.obolibrary.org/obo/mondo" or "http://identifiers.org/hgnc".
Cache an ontology for repeated use by running the tool while instructing it to cache the ontology via -c <name>
:
python text2term -s test/unstruct_terms.txt -t http://purl.obolibrary.org/obo/mondo.owl -c MONDO
Now the ontology is cached and we can refer to it as the target ontology using the name given beforehand:
python text2term -s test/unstruct_terms.txt -t MONDO
After installing and importing to a Python environment, the main function is map_terms()
:
text2term.map_terms(source_terms, # strings to map or pointer to file
target_ontology='mondo', # ontology to map to (URL or name)
max_mappings=3, # maximum mappings per string
mapper=Mapper.TFIDF, # mapping method to be used
min_score=0.3, # minimum mapping score
base_iris=(), # map to terms with given base IRIs
excl_deprecated=False, # exclude ontology deprecated terms
term_type='class', # ontology term type(s) to map to
save_graphs=False, # save vis.js file with term graphs
save_mappings=False, # save mappings to file or mot
output_file='', # filepath of output mappings file
csv_columns=(), # table columns with strings and IDs
separator=',', # column separator of input table
use_cache=False, # use a locally cached ontology
incl_unmapped=False, # include unmapped strings in output
bioportal_apikey='') # API key to use the BioPortal mapper
The function returns a pandas DataFrame
containing the generated ontology mappings.
source_terms
—Strings to be mapped to an ontology, which can be specified as a:
- list of strings
- string containing a file path
- dictionary of terms and associated tags, where each key is a term and the value is a list of tags
- list of
TaggedTerm
objects- Tags do not affect the mapping, they are simply added to the output dataframe
- If a term is tagged with "Ignore", text2term will not map it
- Unmapped terms can still be included in the output if
incl_unmapped
is True
target_ontology
—Path, URL or name of 'target' ontology to map the source terms to. Ontology names can be given as values to target_ontology
e.g. "EFO" or "CL"—text2term uses bioregistry to get URLs for such names. Similarly, when the target ontology has been cached, enter the name used upon caching.
When using the BioPortal or Zooma interfaces, the value for target_ontology
should be a comma-separated list of ontology acronyms (eg 'EFO,HPO') or 'all' to search all ontologies.
base_iris
—Map only to ontology terms whose IRIs start with one of the strings given in this tuple
excl_deprecated
—Exclude ontology terms stated as deprecated via owl:deprecated true
source_terms_ids
—Collection of identifiers for the given source terms
csv_column
—Specify the name of the column containing the terms to map, when the input file is a table. Optionally provide a second column name, containing the respective term identifiers
separator
—Character that separates columns when input is a table (eg '\t' for TSV)
mapper
—Method used to compare source terms with ontology terms. One of levenshtein, jaro, jarowinkler, jaccard, fuzzy, tfidf, zooma, bioportal
(see Supported Mappers)
max_mappings
—Maximum number of top-ranked mappings returned per source term
min_score
—Minimum similarity score [0,1] for the mappings (1=exact match)
save_mappings
—Save the generated mappings to a file (specified by output_file
)
output_file
—Path to desired output file for the mappings dataframe
save_graphs
—Save vis.js graphs representing the neighborhood of each ontology term
use_cache
—Use the cache for the ontology
term_type
—Specifies whether to map to ontology classes, properties or both. One of class, property, any
incl_unmapped
—Include unmapped terms in the output. If a term has been tagged 'Ignore' or has less than the min_score
, it is included in the output data frame
bioportal_apikey
—BioPortal API Key to use along with the BioPortal mapper option
text2term supports caching ontologies for faster or repeated mapping to the same ontology. An ontology can be cached using the function:
text2term.cache_ontology(ontology_url, ontology_acronym="", base_iris=())
This caches a single ontology from a URL or file path, and takes an optional acronym that will be used to reference the cached ontology later. If no acronym is given, the URL is used as the name.
It is also possible to cache multiple ontologies, whose names and URLs are specified in a table formatted as such acronym,version,url
. An example is provided in resources/ontologies.csv:
text2term.cache_ontology_set(ontology_registry_path)
Once an ontology has been cached by either function, it is stored in a cache folder locally, and thus can be referenced even in different Python instances. Users can leverage the cache by using the assigned acronym as the value for the target_ontology
argument, and setting the use_cache
argument to True
.
To clear the ontology cache, the following function can be used:
text2term.clear_cache(ontology_acronym='')
If no arguments are specified, the entire cache will be cleared. Otherwise, only the ontology with the given acronym will be cleared.
Finally, cache_exists(ontology_acronym='')
is a simple function that returns True
if the given acronym exists in the cache, and False
otherwise.
Note
The cache_ontology
function returns an object that can be used to directly call the map_terms
function, as well as clear_cache
and cache_exists
. These have the same arguments, except ontology_target
is no longer specified and there is no use_cache
option, since it is always True.
Caution
While ontology URLs can be repeatedly used, acronyms must be distinct in a given environment.
text2term includes a module preprocess.py
that supports regular expression-based preprocessing functionality for input terms. There are functions that take the input terms and a collection of (user-defined) regular expressions, then match each term to each regular expression to simplify the input term.
preprocess.preprocess_terms(terms, template_path, output_file='', blocklist_path='',
blocklist_char='', rem_duplicates=False)
This returns a dictionary where the keys are the original terms and the values are the preprocessed terms.
preprocess.preprocess_tagged_terms(file_path, template_path='', blocklist_path='',
blocklist_char='', rem_duplicates=False, separator=';:;')
This returns a list of TaggedTerm
objects.
The regex templates file template_path
and the blocklist blocklist_path
must each be a newline-separated file. If an output file is specified, the preprocessed strings are written to that file.
The blocklist functionality allows specifying another file with regular expressions that, when terms match any such regex in the blocklist, they are removed from the list of terms to map. Alternatively, if a blocklist character is specified, the input is replaced with that character.
The rem_duplicates
option removes all duplicate terms after processing, if set to True
.
When the input to text2term is a table, any rows that contain NA
values in the specified term column, or in the term ID column (if provided), will be ignored.
If an ignore tag "ignore"
or "Ignore"
is added to a term, that term will not be mapped to any terms in the ontology. It will only be included in the output if the incl_unmapped
argument is True. The following values are regarded as ignore tags: `"ignore", "Ignore".
After installing, execute the tool from a command line as follows:
python text2term [-h] -s SOURCE -t TARGET [-o OUTPUT] [-m MAPPER] [-csv CSV_INPUT] [-sep SEPARATOR] [-top TOP_MAPPINGS] [-min MIN_SCORE] [-iris BASE_IRIS] [-d] [-g] [-c STORE_IN_CACHE] [-type TERM_TYPE] [-u] [-bp BIOPORTAL_APIKEY]
To display a help message with descriptions of tool arguments do:
python text2term -h
or python text2term --help
-s SOURCE
Input file containing 'source' terms to map to ontology terms (list of terms or CSV file)
-t TARGET
Path or URL of 'target' ontology to map source terms to. When the chosen mapper is BioPortal or Zooma, provide a comma-separated list of acronyms (eg 'EFO,HPO') or write 'all'
to search all ontologies
-o OUTPUT
Path to desired output file for the mappings
-m MAPPER
Method used to compare source terms with ontology terms. One of: levenshtein, jaro, jarowinkler, jaccard, indel, fuzzy, tfidf, zooma, bioportal
-csv CSV_INPUT
Indicates a CSV format input—follow with the name of the column containing terms to map, optionally followed by the name of the column containing identifiers for the terms (eg 'my terms,my term ids')
-sep SEPARATOR
Specifies the cell separator to be used when reading a table
-top TOP_MAPPINGS
Maximum number of top-ranked mappings returned per source term
-min MIN_SCORE
Minimum similarity score [0,1] for the mappings (1=exact match)
-iris BASE_IRIS
Map only to ontology terms whose IRIs start with a value given in this comma-separated list (eg 'http://www.ebi.ac.uk/efo,http://purl.obolibrary.org/obo/HP)')
-d
Exclude ontology terms stated as deprecated via owl:deprecated true
-g
Save vis.js graphs representing the neighborhood of each ontology term
-c STORE_IN_CACHE
Cache the target ontology using the name given here
-type TERM_TYPE
Specify whether to map to ontology classes, properties, or both
-u
Include all unmapped terms in the output
-bp
BioPortal API Key to use along with the BioPortal mapper option
The mapping score of each mapping indicates how similar an input term is to an ontology term (via its labels or synonyms). The mapping scores generated by text2term are the result of applying one of the following mappers:
TF-IDF-based mapper—TF-IDF is a statistical measure often used in information retrieval that measures how important a word is to a document in a corpus of documents. We first generate TF-IDF-based vectors of the source terms and of labels and synonyms of ontology terms. Then we compute the cosine similarity between vectors to determine how similar a source term is to a target term (label or synonym).
Syntactic distance-based mappers—text2term provides support for commonly used and popular syntactic (edit) distance metrics: Levenshtein, Jaro, Jaro-Winkler, Jaccard, and Indel. We use the nltk package to compute Jaccard distances and rapidfuzz to compute all others.
BioPortal Web API-based mapper—uses an interface to the BioPortal Annotator that we built to allow mapping terms in bulk to ontologies in the BioPortal repository.
Warning
There are no scores associated with BioPortal annotations, so the score of all mappings is always 1
Zooma Web API-based mapper—uses a Zooma interface that we built to allow mapping terms in bulk to ontologies in the Ontology Lookup Service (OLS) repository.
Important
When using the BioPortal or Zooma interfaces, make sure to specify the target ontology name(s) as they appear in BioPortal or OLS, respectively
Note
Syntactic distance-based mappers and Web API-based mappers perform slowly (much slower than the TF-IDF mapper). The former because they do pairwise comparisons between each input string and each ontology term label/synonym. In the Web API-based approaches there are networking and API load overheads