LanguageMachines/frog

Frog creates invalid FoLiA

kosloot opened this issue · 1 comments

consider the following document:

<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="strbug" generator="libfolia-v1.14" version="1.5.0">
  <metadata type="native">
    <annotations>
    </annotations>
  </metadata>
  <text xml:id="strbug.text">
    <p xml:id="p.1">
      <t>Chipssnijden</t>
      <str xml:id="str.1">
	<t>Chipssnijden</t>
      </str>
    </p>
  </text>
</FoLiA>

it contains an obsolete Dutch ij ligature.

Frog will handle this file replacing the ij by ij which is wrong.
Replacing should only be done when there is a '--outputclass' specified different from the '--inputclass' (which is "current" here)

In this case 'inputclass' and 'outputclass' are not specified, so both are "current" but that is interpreted wrong apparently.
This yields the erroneous document:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="strbug" generator="libfolia-v1.14" version="1.5.0">
  <metadata type="native">
    <annotations>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2018-10-18T11:11:43" set="tokconfig-nld"/>
      <pos-annotation annotator="frog-mbpos-1.0" annotatortype="auto" datetime="2018-10-18T11:11:43" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn"/>
      <lemma-annotation annotator="frog-mblem-1.1" annotatortype="auto" datetime="2018-10-18T11:11:43" set="http://ilk.uvt.nl/folia/sets/frog-mblem-nl"/>
    </annotations>
  </metadata>
  <text xml:id="strbug.text">
    <p xml:id="p.1">
      <t>Chipssnijden</t>
      <str xml:id="str.1">
        <t>Chipssnijden</t>
      </str>
      <s xml:id="p.1.s.1">
        <w xml:id="p.1.s.1.w.1" class="WORD">
          <t>Chipssnijden</t>
          <pos class="N(soort,mv,basis)" confidence="0.942748" head="N">
            <feat class="soort" subset="ntype"/>
            <feat class="mv" subset="getal"/>
            <feat class="basis" subset="graad"/>
          </pos>
          <lemma class="chipssnijden"/>
        </w>
      </s>
    </p>
  </text>
</FoLiA>

In this document the 'deeper' text Chipssnijden from the Word, does not match the Chipssnijden from the Paragraph, as folialint points out:

inconsistent text: node p(p.1) has a mismatch for the text in set:current
the element text ='Chipssnijden'
the deeper text ='Chipssnijden'

Ok, the problem was quite obscure:
frog -x filename.xml -X out.xml
DID work correctly

BUT
frog -X out.xml filename.xml
DIDn't

The reason being that, when frog detects an XML file by its extension, it didn't check whether inputclass was the same as outputclass .
When using -x, fo force XML input, this WAS checked.

Fixed now.