clarin-eric/ParlaMint

PT: wrong/no sentence segmentation

Opened this issue · 0 comments

The source of transcriptions of PT debates does not seem to contain paragraphs, but in the corpus, it is somehow segmented into paragraphs (my guess is if the punctuation ./?/ is at the end of the line then paragraph<seg> ends)

https://debates.parlamento.pt/catalogo/r3/dar/01/13/04/035/2019-01-04?sft=true#p5
"paragraphs" are framed:
image

The TEI:

<seg xml:id="ParlaMint-PT_2019-01-04.seg21">Em primeiro <!-- 
--> privada. A segurança <!--
--> complementar.</seg>

The TEI.ana:

<seg xml:id="ParlaMint-PT_2019-01-04.seg21">
  <s xml:id="ParlaMint-PT_2019-01-04.seg21.s">
    <w xml:id="ParlaMint-PT_2019-01-04.seg21.s.1" msd="UPosTag=ADP" lemma="em">Em</w>
    <w xml:id="ParlaMint-PT_2019-01-04.seg21.s.2" msd="UPosTag=ADJ|Gender=Masc|Number=Sing" lemma="primeiro">primeiro</w>
    <!-- -->
    <w xml:id="ParlaMint-PT_2019-01-04.seg21.s.14" msd="UPosTag=ADJ|Gender=Fem|Number=Sing" lemma="privar,privado" join="right">privada</w>
    <pc xml:id="ParlaMint-PT_2019-01-04.seg21.s.15" msd="UPosTag=PUNCT">.</pc>
    <w xml:id="ParlaMint-PT_2019-01-04.seg21.s.16" msd="UPosTag=DET|Gender=Fem|Number=Sing" lemma="a">A</w>
    <w xml:id="ParlaMint-PT_2019-01-04.seg21.s.17" msd="UPosTag=NOUN|Gender=Fem|Number=Sing" lemma="segurança">segurança</w>
    <!-- -->
    <w xml:id="ParlaMint-PT_2019-01-04.seg21.s.47" msd="UPosTag=ADJ|Gender=Fem|Number=Sing" lemma="complementar" join="right">complementar</w>
    <pc xml:id="ParlaMint-PT_2019-01-04.seg21.s.48" msd="UPosTag=PUNCT">.</pc>
    <linkGrp targFunc="head argument" type="UD-SYN"><!-- --> </linkGrp>
  </s>
</seg>