biotea-annotation

Refactorization for the annotation code at https://github.com/alexgarciac/biotea. RDF annotation for PubMed and PMC using entity recognition tools such as the NCBO Annotator (http://www.bioontology.org/annotator-service) and CMA (http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/810/664). CMA is not a public service thus this documentation refers to annotations with NCBO Annotator

Dependencies

Most of the dependendies are configured with Maven. There is however a couple of local dependencies to biotea-utilities, biotea-ao and one jar located at the lib directory provided with this project.

This project uses the NCBO Annotator thus annotations obtained for the same file at different times can vary due to changes in the ontologies and responses retrieved from the annotator.

How run this project using the batch option

Clone biotea-utilities
Clone biotea-ao
Clone this repository
In your IDE, create a dependency from this project to biotea-utilities and biotea-ao and jars in the lib directory
Modify configuration files, i.e., config.properties, in biotea-utilities resources folder (path-to-biotea-utilities/src/main/resources/config.properties). If you are generating annotations for RDFized articles with biotea-rdfization, make sure you use the same configuration there. Most of the time you only need to change the following properties:
- biotea.dataset.prefix: Either pmc or pubmed
- biotea.dataset: For instance dataset/pmc or dataset/pubmed or bio2rdf_dataset:bio2rdf-pmc-vrX or bio2rdf_dataset:bio2rdf-pubmed-vrX. This will be used in the VOiD properties of the generated dataset.
- biotea.base: For instance biotea.ws or bio2rdf.org. This will be used to generate the URI to resources. bio2rdf will generate URIs compatible with Bio2RDF URI style.
- ncbo.annotator.exclude: Aliases for those ontologies that should not be used by the NCBO Annotator. All the aliases are defined as properties at path-to-biotea-utilities/src/main/resources/ontologies.properties.
Specify a valid API-KEY to use the NCBO Annotator or the AgroPortal annotator at path-to-biotea-utilities/src/main/resources/apikey.properties
Make sure you include the biotea-utilities resources folder in your classpath
The main class is ws.biotea.ld2rdf.annotation.batch.BatchApplication some parameters are needed:
- -in --mandatory, should point to a directory with all the files to be annotated
- -out --mandatory
- -annotator --optional, use ncbo (default value) or agroportal.
- -extension --mandatory, only files at with this extension will be processed, either nxml or rdf is our recommendation
- -inStyle --optional, either jats_file (default value) or rdf_file
- -onto --optional, either ao for the Annotation Ontology or oa for the Open Annotation, this defines the annotation ontology used to serialize the annotations
- -format --optional, either XML (default value) or JSON-LD
- -onlyTA --optional, if present, ontly title and abstract will be annotated

Input

If jats_file is used as inStyle option:

Input files should follow the JATS DTDs
In order to be able to process input articles in batch, those should be located in a folder as the one provided at nxmlInputToProcess.
A valid JATS input file is provided at nxmlInputToProcess/DELETE_ME_PMC3879346.nxml
The corresponding output following the AO notation is provided at output/PMC3879346_ncbo_annotations_AO_JATS.rdf
The corresponding output following the OA notation is provided at output/PMC3879346_ncbo_annotations_OA_JATS.rdf

If rdf_file is used as inStyle option:

Input files should correspond to output files generated by biotea-rdfization
Only RDF files corresponding to sections should be used
A valid RDF input file is provided at rdfInputToProcess/PMC3879346_sections.rdf
The corresponding output following the AO notation is provided at output/PMC3879346_ncbo_annotations_AO_RDF.rdf
The corresponding output following the OAnotation is provided at output/PMC3879346_ncbo_annotations_OA_RDF.rdf

Output

One RDF file per input file

###Examples For instance, if you want to annotate PMC articles following the Bio2RDF URL model you need this configuration:

biotea.dataset.prefix=pmc
biotea.dataset=bio2rdf_dataset:bio2rdf-pmc-vr2
biotea.base=bio2rdf.org Remember to specify a valid API KEY in apikey.properties

If you want to annotate JATS files with extension nxml and get RDF/XML files following AO model use:

java ws.biotea.ld2rdf.annotation.batch.BatchApplication -in -out -extension nxml which is equivalent to the following that also specify all parameters with default values
java ws.biotea.ld2rdf.annotation.batch.BatchApplication -in -out -extension nxml -inStyle jats_file -annotator ncbo -onto ao -format XML

If you want to annotate RDF files with extension rdf and get RDF/XML files following AO model use:

java ws.biotea.ld2rdf.annotation.batch.BatchApplication -in -out -extension rdf -inStyle rdf_file which is equivalent to the following that also specify all parameters with default values
java ws.biotea.ld2rdf.annotation.batch.BatchApplication -in -out -extension nxml -inStyle rdf_file -annotator ncbo -onto ao -format XML

If you want to annotate JATS files with extension nxml and get RDF/XML files following OA model use:

java ws.biotea.ld2rdf.annotation.batch.BatchApplication -in -out -extension nxml -onto OA which is equivalent to the following that also specify all parameters with default values
java ws.biotea.ld2rdf.annotation.batch.BatchApplication -in -out -extension nxml -inStyle jats_file -annotator ncbo -onto OA -format XML

fanavarro/biotea-annotation

biotea-annotation

Dependencies

How run this project using the batch option

Input

Output