PT: wrong/no sentence segmentation
Opened this issue · 0 comments
matyaskopp commented
The source of transcriptions of PT debates does not seem to contain paragraphs, but in the corpus, it is somehow segmented into paragraphs (my guess is if the punctuation .
/?
/ is at the end of the line then paragraph<seg>
ends)
https://debates.parlamento.pt/catalogo/r3/dar/01/13/04/035/2019-01-04?sft=true#p5
"paragraphs" are framed:
The TEI:
<seg xml:id="ParlaMint-PT_2019-01-04.seg21">Em primeiro <!--
--> privada. A segurança <!--
--> complementar.</seg>
The TEI.ana:
<seg xml:id="ParlaMint-PT_2019-01-04.seg21">
<s xml:id="ParlaMint-PT_2019-01-04.seg21.s">
<w xml:id="ParlaMint-PT_2019-01-04.seg21.s.1" msd="UPosTag=ADP" lemma="em">Em</w>
<w xml:id="ParlaMint-PT_2019-01-04.seg21.s.2" msd="UPosTag=ADJ|Gender=Masc|Number=Sing" lemma="primeiro">primeiro</w>
<!-- -->
<w xml:id="ParlaMint-PT_2019-01-04.seg21.s.14" msd="UPosTag=ADJ|Gender=Fem|Number=Sing" lemma="privar,privado" join="right">privada</w>
<pc xml:id="ParlaMint-PT_2019-01-04.seg21.s.15" msd="UPosTag=PUNCT">.</pc>
<w xml:id="ParlaMint-PT_2019-01-04.seg21.s.16" msd="UPosTag=DET|Gender=Fem|Number=Sing" lemma="a">A</w>
<w xml:id="ParlaMint-PT_2019-01-04.seg21.s.17" msd="UPosTag=NOUN|Gender=Fem|Number=Sing" lemma="segurança">segurança</w>
<!-- -->
<w xml:id="ParlaMint-PT_2019-01-04.seg21.s.47" msd="UPosTag=ADJ|Gender=Fem|Number=Sing" lemma="complementar" join="right">complementar</w>
<pc xml:id="ParlaMint-PT_2019-01-04.seg21.s.48" msd="UPosTag=PUNCT">.</pc>
<linkGrp targFunc="head argument" type="UD-SYN"><!-- --> </linkGrp>
</s>
</seg>