This script converts the Annotated Corpus for Occitan (Bras ea 2018; CC BY-SA 4.0) to UPOS by splitting contractions. It was used for the paper Does manipulating tokenization aid cross-lingual transfer? A study on POS tagging for non-standardized languages (Blaschke ea, VarDial 2023, link).
# Get the data
wget https://zenodo.org/record/1182949/files/CorpusRestaureOccitan.zip
unzip CorpusRestaureOccitan.zip
rm CorpusRestaureOccitan.zip
rm -r __MACOSX
python3 convert.py --glob "CorpusRestaureOccitan/*" --out "test_ROci_UPOS.tsv"
The original corpus is available in a slightly modified version of UPOS as well as a custom tagset (Bernhard ea 2018).
We're only concerned with the former.
While the corpus annotation also mentions the non-UPOS tags EPE
and MOD
, these are not actually in the RESTAURE Occitan corpus.
Thus, the only tag we're concerned with is ADP+DET
.
Analogously to the other Romance UD treebanks, we split the relevant word forms into one ADP
token and one DET
token (e.g. dau ADP+DET
-> de ADP
+ lo DET
).
The convert.py
file contains the full mapping of contractions to split lemmas.