The NCBO Resource Index project processes a variety of biomedical resources (i.e., collections of documents or data) and generates annotations using classes from BioPortal.
In addition to annotations, the Resource Index is capable of generating a co-occurence matrix for labels and classes found in BioPortal (read more).
- Ruby 2.x
- Elasticsearch 1.x
- mgrep (available on the NCBO Virtual Appliance)
- MySQL (used for defining individual resources, their metadata, and normalized versions of their documents/data)
git clone https://github.com/ncbo/resource_index.git
cd resource_index
bundle
The Resource Index is comprised of two parts: 1) Population and 2) Deployment.
For populating, a ResourceIndex::Population::Manager object can be instantiated by passing in a resource object along with configuration options. Configuration options and their defaults are detailed in lib/ncbo_resource_index/population/population.rb
.
Here is a sample, basic script that will configure and run a population job:
require 'resource_index'
# this mysql db should contain resources and their data
RI.config(username: "user", password: "pass", host: "localhost")
res = RI::Resource.find("PM")
populator = RI::Population::Manager.new(res,
{ annotator_redis_host: "redis1",
mgrep_host: "mgrep1",
goo_host: "4store1",
goo_port: 8080,
population_threads: 4,
mail_recipients: "user@example.org" })
populator.populate(delete_old: true)
During the population process, the emails listed in mail_recipients
will get an email if the process encounters an error or finishes.
Resources come from a variety of sources that are processed through ETL using a Resource Access Tool
- Introspect resource layout (column data) and create corresponding ElasticSearch document mapping
- Annotate documents using mgrep and annotator cache
- Store documents and annotations in Elasticsearch
- Annotations are stored as follows:
- Direct (set of int)
- Ancestors (set of int)
- Class ids are hashes of the ontology acronym and class URI created with xxhash
- Annotations are stored as follows:
Currently, updates are not supported.
In the future, it might be possible to update the Resource Index by looking at the dictionary generated by annotator and using the new labels to run a search in ElasticSearch and then just update the documents that contain hits for the new labels. The theory is that the dictionary actually doesn't change that much and so just looking for hits on the new labels would be better than re-annotating the entire corpus.
The following is an example mapping for a PubMed abstract. The annotations are nested and just the tokens get stored, not the original json. Nesting them puts them in a separate index, allowing you to search just the annotations very quickly.
{
"mappings": {
"citation": {
"_source" : {
"includes": ["title", "abstract"],
"excludes": ["annotations"]
},
"properties": {
"title": {
"type": "string"
},
"abstract": {
"type": "string"
},
"annotations": {
"type": "nested",
"properties": {
"direct": {"type": "long", "store": false, "include_in_all": false},
"ancestors": {"type": "long", "store": false, "include_in_all": false},
}
}
}
}
}
}
The default MRI implementation of Ruby is not great at efficiently handling threads. There is a version of Ruby that can run in the JVM called Jruby, and switching to this provides a nice performance boost, especially as we can run the population process in threads to take advantage of a shared in-memory cache.
The code is written to work in both MRI or Jruby, but population should always be done using Jruby.
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request