Tool package to convert between KorAP XML format and CoNLL-U format, as well as other simple formats,including token boundary information.
The state of the package is very preliminary. Currently, two scripts are provided:
korapxml2conllu
converts KorAP XML zip "base" and "morpho" (with POS and lemma annotations) files to corresponding CoNLL-U (or word2vec input) files with foundry information, text ids and token offsets in commentsconllu2korapxml
converts CoNLL-U files that follow KorAP-specific comment conventions and contain morphosyntactic and/or dependency annotations to corresponding KorAP-XML zip files
cpanm https://github.com/KorAP/KorAP-XML-CoNLL-U.git
perl Makefile.PL
make
make test TEST_VERBOSE=1
make install
$ korapxml2conllu wpd17.tree_tagger.zip | head -42
# foundry = tree_tagger
# filename = WPD17/A00/00001/tree_tagger/morpho.xml
# text_id = WPD17_A00.00001
# start_offsets = 0 0 5 13 19 23 33 37 43 52 61 63 67 73 85 87 91 97 101 113 123 130 136 142 146 150 155 158 169 178 184 190
# end_offsets = 191 4 12 18 22 32 36 42 51 61 62 66 72 85 86 90 96 100 112 122 129 135 141 145 149 154 157 168 177 183 190 191
1 Alan Alan NE NE _ _ _ _ 1.000000
2 Smithee -- NE NE _ _ _ _ 1.000000
3 steht stehen VVFIN VVFIN _ _ _ _ 1.000000
4 als als KOKOM KOKOM _ _ _ _ 0.995658
5 Pseudonym Pseudonym NN NN _ _ _ _ 1.000000
6 für für APPR APPR _ _ _ _ 1.000000
7 einen eine ART ART _ _ _ _ 0.998238
8 fiktiven fiktiv ADJA ADJA _ _ _ _ 1.000000
9 Regisseur Regisseur NN NN _ _ _ _ 1.000000
10 , , $, $, _ _ _ _ 1.000000
11 der die ART ART _ _ _ _ 0.954604
12 Filme Film NN NN _ _ _ _ 1.000000
13 verantwortet verantworten VVPP VVPP _ _ _ _ 0.753983
14 , , $, $, _ _ _ _ 1.000000
15 bei bei APPR APPR _ _ _ _ 0.999325
16 denen die PDS PDS _ _ _ _ 0.906725
17 der die ART ART _ _ _ _ 0.998927
18 eigentliche eigentlich ADJA ADJA _ _ _ _ 1.000000
19 Regisseur Regisseur NN NN _ _ _ _ 1.000000
20 seinen sein PPOSAT PPOSAT _ _ _ _ 1.000000
21 Namen Name NN NN _ _ _ _ 1.000000
22 nicht nicht PTKNEG PTKNEG _ _ _ _ 1.000000
23 mit mit APPR APPR _ _ _ _ 0.999012
24 dem die ART ART _ _ _ _ 0.999949
25 Werk Werk NN NN _ _ _ _ 1.000000
26 in in APPR APPR _ _ _ _ 1.000000
27 Verbindung Verbindung NN NN _ _ _ _ 1.000000
28 gebracht bringen VVPP VVPP _ _ _ _ 0.999331
29 haben haben VAINF VAINF _ _ _ _ 0.999987
30 möchte mögen VMFIN VMFIN _ _ _ _ 1.000000
31 . . $. $. _ _ _ _ 1.000000
# start_offsets = 192 192 196 201 205 210 216 219 223 227 237 243 246 254 255 258 260 264 271 283 292 294 302 306 309 316 319
# end_offsets = 320 195 200 204 209 215 218 222 226 236 242 245 253 255 258 259 263 270 282 292 293 301 305 308 315 319 320
1 Von von APPR APPR _ _ _ _ 0.999214
2 1968 1968 CARD CARD _ _ _ _ 1.000000
3 bis bis APPR APPR _ _ _ _ 0.861721
$ ./script/korapxml2conllu t/data/goe.zip | head -20
# foundry = base
# filename = GOE/AGA/00000/base/tokens.xml
# text_id = GOE_AGA.00000
# start_offsets = 0 0 9 12
# end_offsets = 22 8 11 22
1 Campagne _ _ _ _ _ _ _ _
2 in _ _ _ _ _ _ _ _
3 Frankreich _ _ _ _ _ _ _ _
# start_offsets = 23 23
# end_offsets = 27 27
1 1792 _ _ _ _ _ _ _ _
# start_offsets = 28 28 33 37 40 44 53
# end_offsets = 54 32 36 39 43 53 54
1 auch _ _ _ _ _ _ _ _
2 ich _ _ _ _ _ _ _ _
3 in _ _ _ _ _ _ _ _
4 der _ _ _ _ _ _ _ _
5 Champagne _ _ _ _ _ _ _ _
./script/korapxml2conllu --word2vec t/data/wdf19.zip
Arts visuels Pourquoi toujours vouloir séparer BD et Manga ?
Ffx 18:20 fév 25 , 2003 ( CET ) soit on ne sépara pas , soit alors on distingue aussi , le comics , le manwa , le manga ..
la bd belge et touts les auteurs européens ..
on commence aussi a parlé de la bd africaine et donc ...
wikipedia ce prete parfaitement à ce genre de decryptage .
…
./script/korapxml2conllu -m '<textSigle>([^<]+)' -m '<creatDate>([^<]+)' --word2vec t/data/wdf19.zip
WDF19/A0000.10894 2014.08.28 Arts visuels Pourquoi toujours vouloir séparer BD et Manga ?
WDF19/A0000.10894 2014.08.28 Ffx 18:20 fév 25 , 2003 ( CET ) soit on ne sépara pas , soit alors on distingue aussi , le comics , le manwa , le manga ..
WDF19/A0000.10894 2014.08.28 la bd belge et touts les auteurs européens ..
WDF19/A0000.10894 2014.08.28 on commence aussi a parlé de la bd africaine et donc ...
WDF19/A0000.10894 2014.08.28 wikipedia ce prete parfaitement à ce genre de decryptage .
./script/conllu2korapxml < t/data/goe.morpho.conllu > goe.morpho.zip
Author:
Copyright (c) 2024, Leibniz Institute for the German Language, Mannheim, Germany
This package is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for German Language (IDS).
It is published under the BSD 2-clause "Simplified" license.
Contributions are very welcome!
Your contributions should ideally be committed via our Gerrit server to facilitate reviewing ( see Gerrit Code Review - A Quick Introduction if you are not familiar with Gerrit). However, we are also happy to accept comments and pull requests via GitHub.