Python 3 Library for Reading the Text Content and Metadata of TEI P5 (Lite) Files
The library focuses on extracting the main text content from a file and providing the available metadata about the text.
It was originally created to support importing TEI formatted corpora into GrETEL, using the corpus2alpino library.
pip install tei-reader
from tei_reader import TeiReader
reader = TeiReader()
corpora = reader.read_file('example-tei.xml') # or read_string
print(corpora.text)
# show element attributes before the actual element text
print(corpora.tostring(lambda x, text: str(list(a.key + '=' + a.text for a in x.attributes)) + text))
A reader can be opened using TeiReader()
. It is then possible to either call read_file(file_name)
or read_string(str)
. Both will return a Corpora
object containing the following properties:
Property | Description |
---|---|
corpora[] |
A corpora can contain sub-corpora. |
documents[] |
The Document objects directly part of this corpora. |
Corpora
and Document
all inherit from Element
. In all objects deriving from this it is possible to call:
Property | Description |
---|---|
attributes{} |
Contain attributes applicable to this element. If an attribute contains attributes these are also returned. (e.g. encodingDesc::editorialDecl::normalization ) |
text |
Get the entire text content as str |
divisions[] |
Recursively get all the text divisions in document order. If an element contains parts or text without tag. Those will be returned in order and wrapped with a PlaceholderDivision . |
all_parts[] |
Recursively get the parts in document order constituting the entire text e.g. if something has emphasis, a footnote or is marked as foreign. Text without a container element will be returned in order and wrapped with a PlaceholderPart . |
parts[] |
Get the parts in document order directly below the current element. |
Attribute
, PlaceholderDivision
and PlaceholderPart
all support the same properties as Element
.
python setup.py sdist
twine upload dist/*