bconv
offers format conversion and manipulation of documents with text and annotations.
It supports various popular formats used in natural-language processing for biomedical texts.
The following formats are currently supported:
Name | I | O | T | A | Description |
---|---|---|---|---|---|
bioc_xml , bioc_json |
✓ | ✓ | ✓ | ✓ | BioC |
bionlp |
✓ | ✓ | BioNLP stand-off | ||
brat |
✓ | ✓ | brat stand-off | ||
conll |
✓ | ✓ | ✓ | ✓ | CoNLL |
europepmc , europepmc.zip |
✓ | ✓ | Europe-PMC JSON | ||
pubtator , pubtator_fbk |
✓ | ✓ | ✓ | ✓ | PubTator |
pubmed , pxml |
✓ | ✓ | PubMed abstracts | ||
pmc , nxml |
✓ | ✓ | PMC full-text | ||
pubanno_json , pubanno_json.tgz |
✓ | ✓ | ✓ | ✓ | PubAnnotation JSON |
csv , tsv |
✓ | ✓ | comma/tab-separated values | ||
text_csv , text_tsv |
✓ | ✓ | ✓ | comma/tab-separated values | |
txt |
✓ | ✓ | ✓ | plain text | |
txt.json |
✓ | ✓ | ✓ | collection of plain-text documents |
I: input format; O: output format; T: can represent text; A: can represent annotations (entities).
bconv
is hosted on PyPI, so you can use pip
to install it:
$ pip install bconv
Load an annotated collection in BioC XML format:
>>> import bconv
>>> coll = bconv.load('path/to/example.xml', fmt='bioc_xml')
>>> coll
<Collection with 37 documents at 0x7f1966e4b3c8>
A Collection is a sequence of Document objects:
>>> coll[0]
<Document with 12 sections at 0x7f1966e2f6d8>
Documents contain Sections, which contain Sentences:
>>> sent = coll[0][3][5]
>>> sent.text
'A Live cell imaging reveals that expression of GFP‐KSHV‐TK, but not GFP induces contraction of HeLa cells.'
Find the first annotation for this sentence:
>>> e = next(sent.iter_entities())
>>> e.start, e.end, e.text
(571, 578, 'KSHV‐TK')
>>> e.metadata
{'type': 'gene/protein', 'ui': 'Uniprot:F5HB62'}
Write the whole collection to a new file in CoNLL format:
>>> with open('path/to/example.conll', 'w', encoding='utf8') as f:
... bconv.dump(coll, f, fmt='conll', tagset='IOBES', include_offsets=True)
bconv
is documented in the GitHub wiki.