biotea-annotation
Refactorization for the annotation code at https://github.com/alexgarciac/biotea. RDF annotation for PubMed and PMC using entity recognition tools such as the NCBO Annotator (http://www.bioontology.org/annotator-service) and CMA (http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/810/664). CMA is not a public service thus this documentation refers to annotations with NCBO Annotator
Dependencies
Most of the dependendies are configured with Maven. There is however a couple of local dependencies to biotea-utilities, biotea-ao and one jar located at the lib directory provided with this project.
This project uses the NCBO Annotator thus annotations obtained for the same file at different times can vary due to changes in the ontologies and responses retrieved from the annotator.
How run this project using the batch option
- Clone biotea-utilities
- Clone biotea-ao
- Clone this repository
- In your IDE, create a dependency from this project to biotea-utilities and biotea-ao and jars in the lib directory
- Modify configuration files, i.e., config.properties, in biotea-utilities resources folder (path-to-biotea-utilities/src/main/resources/config.properties). If you are generating annotations for RDFized articles with biotea-rdfization, make sure you use the same configuration there. Most of the time you only need to change the following properties:
- biotea.dataset.prefix: Either pmc or pubmed
- biotea.dataset: For instance dataset/pmc or dataset/pubmed or bio2rdf_dataset:bio2rdf-pmc-vrX or bio2rdf_dataset:bio2rdf-pubmed-vrX. This will be used in the VOiD properties of the generated dataset.
- biotea.base: For instance biotea.ws or bio2rdf.org. This will be used to generate the URI to resources. bio2rdf will generate URIs compatible with Bio2RDF URI style.
- ncbo.annotator.exclude: Aliases for those ontologies that should not be used by the NCBO Annotator. All the aliases are defined as properties at path-to-biotea-utilities/src/main/resources/ontologies.properties.
- Specify a valid API-KEY to use the NCBO Annotator or the AgroPortal annotator at path-to-biotea-utilities/src/main/resources/apikey.properties
- Make sure you include the biotea-utilities resources folder in your classpath
- The main class is ws.biotea.ld2rdf.annotation.batch.BatchApplication some parameters are needed:
- -in --mandatory, should point to a directory with all the files to be annotated
- -out --mandatory
- -annotator --optional, use ncbo (default value) or agroportal.
- -extension --mandatory, only files at with this extension will be processed, either nxml or rdf is our recommendation
- -inStyle --optional, either jats_file (default value) or rdf_file
- -onto --optional, either ao for the Annotation Ontology or oa for the Open Annotation, this defines the annotation ontology used to serialize the annotations
- -format --optional, either XML (default value) or JSON-LD
- -onlyTA --optional, if present, ontly title and abstract will be annotated
Input
If jats_file is used as inStyle option:
- Input files should follow the JATS DTDs
- In order to be able to process input articles in batch, those should be located in a folder as the one provided at nxmlInputToProcess.
- A valid JATS input file is provided at nxmlInputToProcess/DELETE_ME_PMC3879346.nxml
- The corresponding output following the AO notation is provided at output/PMC3879346_ncbo_annotations_AO_JATS.rdf
- The corresponding output following the OA notation is provided at output/PMC3879346_ncbo_annotations_OA_JATS.rdf
If rdf_file is used as inStyle option:
- Input files should correspond to output files generated by biotea-rdfization
- Only RDF files corresponding to sections should be used
- A valid RDF input file is provided at rdfInputToProcess/PMC3879346_sections.rdf
- The corresponding output following the AO notation is provided at output/PMC3879346_ncbo_annotations_AO_RDF.rdf
- The corresponding output following the OAnotation is provided at output/PMC3879346_ncbo_annotations_OA_RDF.rdf
Output
- One RDF file per input file
###Examples For instance, if you want to annotate PMC articles following the Bio2RDF URL model you need this configuration:
- biotea.dataset.prefix=pmc
- biotea.dataset=bio2rdf_dataset:bio2rdf-pmc-vr2
- biotea.base=bio2rdf.org Remember to specify a valid API KEY in apikey.properties
If you want to annotate JATS files with extension nxml and get RDF/XML files following AO model use:
- java ws.biotea.ld2rdf.annotation.batch.BatchApplication -in -out -extension nxml which is equivalent to the following that also specify all parameters with default values
- java ws.biotea.ld2rdf.annotation.batch.BatchApplication -in -out -extension nxml -inStyle jats_file -annotator ncbo -onto ao -format XML
If you want to annotate RDF files with extension rdf and get RDF/XML files following AO model use:
- java ws.biotea.ld2rdf.annotation.batch.BatchApplication -in -out -extension rdf -inStyle rdf_file which is equivalent to the following that also specify all parameters with default values
- java ws.biotea.ld2rdf.annotation.batch.BatchApplication -in -out -extension nxml -inStyle rdf_file -annotator ncbo -onto ao -format XML
If you want to annotate JATS files with extension nxml and get RDF/XML files following OA model use:
- java ws.biotea.ld2rdf.annotation.batch.BatchApplication -in -out -extension nxml -onto OA which is equivalent to the following that also specify all parameters with default values
- java ws.biotea.ld2rdf.annotation.batch.BatchApplication -in -out -extension nxml -inStyle jats_file -annotator ncbo -onto OA -format XML