/bconv

Python library for converting between BioNLP formats

Primary LanguagePythonMIT LicenseMIT

bconv: Python library for converting between BioNLP formats

bconv offers format conversion and manipulation of documents with text and annotations. It supports various popular formats used in natural-language processing for biomedical texts.

Supported formats

The following formats are currently supported:

Name I O T A Description
bioc_xml, bioc_json BioC
bionlp BioNLP stand-off
brat brat stand-off
conll CoNLL
europepmc, europepmc.zip Europe-PMC JSON
pubtator, pubtator_fbk PubTator
pubmed, pxml PubMed abstracts
pmc, nxml PMC full-text
pubanno_json, pubanno_json.tgz PubAnnotation JSON
csv, tsv comma/tab-separated values
text_csv, text_tsv comma/tab-separated values
txt plain text
txt.json collection of plain-text documents

I: input format; O: output format; T: can represent text; A: can represent annotations (entities).

Installation

bconv is hosted on PyPI, so you can use pip to install it:

$ pip install bconv

Usage

Load an annotated collection in BioC XML format:

>>> import bconv
>>> coll = bconv.load('path/to/example.xml', fmt='bioc_xml')
>>> coll
<Collection with 37 documents at 0x7f1966e4b3c8>

A Collection is a sequence of Document objects:

>>> coll[0]
<Document with 12 sections at 0x7f1966e2f6d8>

Documents contain Sections, which contain Sentences:

>>> sent = coll[0][3][5]
>>> sent.text
'A Live cell imaging reveals that expression of GFP‐KSHV‐TK, but not GFP induces contraction of HeLa cells.'

Find the first annotation for this sentence:

>>> e = next(sent.iter_entities())
>>> e.start, e.end, e.text
(571, 578, 'KSHV‐TK')
>>> e.metadata
{'type': 'gene/protein', 'ui': 'Uniprot:F5HB62'}

Write the whole collection to a new file in CoNLL format:

>>> with open('path/to/example.conll', 'w', encoding='utf8') as f:
...     bconv.dump(coll, f, fmt='conll', tagset='IOBES', include_offsets=True)

Documentation

bconv is documented in the GitHub wiki.