tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
cat corpus.i5.xml | tei2korapxml - > corpus.korapxml.zip
tei2korapxml
is a script to convert TEI P5 and I5 based documents to the KorAP-XML format.
This program is usually called from inside another script.
TEI P5 formatted input with certain restrictions:
mandatory: text-header with integrated textsigle (or convertable identifier), text-body
optional: corp-header with integrated corpsigle, doc-header with integrated docsigle
All tokens inside the primary text may not be newline seperated, because newlines are removed (see KorAP::XML::TEI::Data) and a conversion of newlines into blanks between 2 tokens could lead to additional blanks, where there should be none (e.g.: punctuation characters like
,
or.
should not be seperated from their predecessor token). (see also code section~ whitespace handling ~
inscript/tei2korapxml
).Header types, like
<idsHeader [...] type="document" [...] >
need to be defined in the same line as the header tag.
zip file output (default on
stdout
) with utf8 encoded entries (which together form the KorAP-XML format)
tei2korapxml
requires libxml2-dev
bindings and File::ShareDir::Install to be installed. When these requirements are met, the preferred way to install the script is to use cpanm.
$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
In case everything went well, the tei2korapxml
tool will be available on your command line immediately.
Minimum requirement for KorAP::XML::TEI is Perl 5.36.
- --input|-i
-
The input file to process. If no specific input is defined and a single dash
-
is passed as an argument, data is read fromSTDIN
. - --root|-r
-
The root directory for output. Defaults to
.
. - --help|-h
-
Print help information.
- --version|-v
-
Print version information.
- --tokenizer-korap|-tk
-
Use the standard KorAP/DeReKo tokenizer.
- --tokenizer-internal|-ti
-
Tokenize the data using two embedded tokenizers, that will take an aggressive and a conservative approach.
- --tokenizer-call|-tc
-
Call an external tokenizer process, that will tokenize from STDIN and outputs the offsets of all tokens.
Texts are separated using
\x04\n
. The external process should add a new line per text.If the "--use-tokenizer-sentence-splits" option is activated, sentences are marked by offset as well in new lines.
To use Datok including sentence splitting, call
tei2korap
as follows:$ cat corpus.i5.xml | tei2korapxml -s \ $ -tc 'datok tokenize \ $ -t ./tokenizer.matok \ $ -p --newline-after-eot --no-sentences \ $ --no-tokens --sentence-positions -' - \ $ > corpus.korapxml.zip
- --skip-inline-tokens
-
Boolean flag indicating that inline tokens should not be processed. Defaults to false (meaning inline tokens will be processed).
- --skip-inline-token-annotations
-
Boolean flag indicating that inline token annotations should not be processed. Defaults to true (meaning inline token annotations won't be processed).
-
Expects a comma-separated list of tags to be ignored when the structure is parsed. Content of these tags however will be processed.
- --xmlid-to-textsigle <from-regex>@<to-c/to-d/to-t>
-
Expects a regular replacement expression (separated by @ between the search and the replacement) to convert text id attributes to text sigles with three parts (separated by /).
Example:
tei2korapxml \ --xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \ -tk - < t/data/icc_german_sample.p5.xml
Converts text id
ICC.German.DeReKo.WPD17.G11.00238
to sigleICCGER/DeReKo.WPD17/G11.00238
. - --inline-tokens <foundry>#[<file>]
-
Define the foundry and file (without extension) to store inline token information in. Unless
--skip-inline-token-annotations
is set, this will contain annotations as well. Defaults totokens
andmorpho
.The inline token data will also be stored in the inline structures file (see --inline-structures), unless the inline token foundry is prepended by an ! exclamation mark, indicating that inline tokens are stored exclusively in the inline tokens file.
Example:
tei2korapxml --inline-tokens '!gingko#morpho' < data.i5.xml > korapxml.zip
- --inline-structures <foundry>#[<file>]
-
Define the foundry and file (without extension) to store inline structure information in. Defaults to
struct
andstructures
. - --base-foundry <foundry>
-
Define the base foundry to store newly generated token information in. Defaults to
base
. - --data-file <file>
-
Define the file (without extension) to store primary data information in. Defaults to
data
. - --header-file <file>
-
Define the file name (without extension) to store header information on the corpus, document, and text level in. Defaults to
header
. - --use-tokenizer-sentence-splits|-s
-
Replace existing with, or add new, sentence boundary information provided by the tokenizer. Currently KorAP-tokenizer and certain external tokenizers support these boundaries.
- --tokens-file <file>
-
Define the file (without extension) to store generated token information in (either from the KorAP tokenizer or an externally called tokenizer). Defaults to
tokens
. - --log|-l
-
Loglevel for Log::Any. Defaults to
notice
.
- KORAPXMLTEI_DEBUG
-
Activate minimal debugging. Defaults to
false
.
Copyright (C) 2021-2024, IDS Mannheim
Author: Peter Harders
Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
KorAP::XML::TEI is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for the German Language (IDS), member of the Leibniz-Gemeinschaft.
This program is free software published under the BSD-2 License.