evt-project/evt-viewer-angular

<pb/> elements parsed incorrectly

Opened this issue · 6 comments

Apparently EVT 3 is currently parsing <pb/> elements not as simple milestones, gathering marked up content between one and the next as correctly done in EVT 2, but as part of the text structure, expecting them to be children of sibling containers:

<div>
   <pb n="1"/>
</div>
<div>
   <pb n="2"/>
</div>

When this structure is not present in the TEI document, i.e.

  1. when <pb/>s follow each other
<div>
   <pb n="1"/>
   <pb n="2"/>
   <pb n="3"/>
   <pb n="4"/>
</div>
  1. when they are nested within other containers, e.g. <p> or <lg> elements
<lg n="13">
...
  <l n="84">non quella vita dentro
    <mod change="#strato-2"><add place="above">a</add></mod> quelle grandi
  </l>
  <l n="85">case, quei sogni, quella verità;</l>              
  <pb facs="52.jpg" n="52"/>
  <l n="86">nel mio cuore una sola verità</l>
  <l n="87">c'era, <app>
...
</lg>

text is parsed and visualized incorrectly, with repeated text blocks in the relevant frame, see screenshots for page 5 and 6 of Saba's manuscript text of the Canzoniere. This bug has been discovered while working on the authorial philology view for the Saba 2021 project, but as you can see from the screenshots it was already present in EVT 3 (screenshots of the alpha version with Saba's transcripted text).

Schermata del 2023-12-31 12-16-56
Schermata del 2023-12-31 12-17-24

After some analysis it seems that the page is correctly parsed until this line of the function parsePageContent of the structure-xml-parser.service.ts (line 102) which I debugged with a couple of crude console.log (sorry)

Screenshot 2024-04-30 181139
Screenshot 2024-04-30 182526

Until that line the pb are correctly parsed but that function getEditionOrigNode is invoked and reverts the node at its original state and then before the removing of the exceeding content (= which is to say all the content after the following pb).
It's important to note that this mentioned exceeding content is redundant because already and correctly contained in the next page output.

Here an xml to easily test the issue: saba_pb_bug.zip
You can verify that the page output in app for page number 4 displays more content (all the "p" element to be precise) after the next pb (number 6) but it should stop with it instead.

Dump of the two console.log before and after line 102, with the pb n="6" marked in light blue.
pre_nt
post_nt

Until that line the pb are correctly parsed but that function getEditionOrigNode is invoked and reverts the node at its original state and then before the removing of the exceeding content (= which is to say all the content after the following pb).
It's important to note that this mentioned exceeding content is redundant because already and correctly contained in the next page output.

If that getEditionOriginNode isn't needed and adds again redundancy which only cause problems, wouldn't be easier to solve this issue by just deleting it?

Until that line the pb are correctly parsed but that function getEditionOrigNode is invoked and reverts the node at its original state and then before the removing of the exceeding content (= which is to say all the content after the following pb).
It's important to note that this mentioned exceeding content is redundant because already and correctly contained in the next page output.

If that getEditionOriginNode isn't needed and adds again redundancy which only cause problems, wouldn't be easier to solve this issue by just deleting it?

You got a point imho. Since we can't afford to wait longer to fix this pb issue, for now we are commenting that line and thus leaving the node as it was before the getEditionOrigNode invocation.
I wonder why that function is invoked though.
Also I'm troubled by if it's needed in some particular cases or exceptions, I look forward for a better solution from whom managed that part or who has ideas to investigate further.

I had to analyse the structure-xml-parser service to implement the DEPA. From my understanding, the problem seems to be that we don't preserve the XML tree when parsing a page.

Basically, the problem arises when we call getElementsBetweenTreeNode that uses the Range API and clones the nodes inserted and returns it. Here's the function:

export function getElementsBetweenTreeNode(start: any, end: any): XMLElement[] {
  const range = document.createRange();
  range.setStart(start, 0);
  range.setEnd(end, end.length || end.childNodes.length);
  const commonAncestorChild = Array.from((range.commonAncestorContainer as XMLElement).children);
  const startIdx = commonAncestorChild.indexOf(start);
  const endIdx = commonAncestorChild.indexOf(end);
  const rangeNodes = commonAncestorChild.slice(startIdx, endIdx).filter((c) => c !== start);
  rangeNodes.forEach((c: XMLElement) => c.setAttribute('xpath', xpath(c).replace(/-/g, '/')));
  const fragment = range.cloneContents();
  const nodes = Array.from(fragment.childNodes);

  return nodes as XMLElement[];
}

As you can see, before returning it, the range's contents are cloned, which detach them from the document. The new parent is now a document fragment (as stated in the cloneContents() API).

Hence, the returned nodes aren't part of the edition XML tree anymore, that's why the getEditionOrigElem is needed. Otherwise, we wouldn't have any way of climbing the tree to find their parents. Why's that needed? Because we need to be able to distinguish <front> content to <body> content.

parsePageContent(doc: Document, pageContent: OriginalEncodingNodeType[]): Array<ParseResult<GenericElement>> {
    return pageContent
      .map((node) => {

        const origEl = getEditionOrigNode(node, doc);

        if (origEl.nodeName === this.frontTagName || isNestedInElem(origEl, this.frontTagName)) {
          ...
        }

        ...
      })
      .reduce((x, y) => x.concat(y), []);
  }

The focus here is the isNestedInElem function, which climbs up the tree to see if origEl is nested in a <front> tag.

I can't check myself the code now but there must be a better way to distinguish between front or body content, like for example two different invocations for the two cases in the parent(s) and/or a parameter in the invoker parent function.

I remember there was already a distinction between front and body pages on a superior level of that service, maybe we can pass down that information?

The elements end up being xml elements without node referrals anyway (Array<ParseResult<GenericElement>>), so there's no need for such an expansive operation like retrieve the node and navigating its ancestors to do such a simple check, in my opinion.

I remember there was already a distinction between front and body pages on a superior level of that service, maybe we can pass down that information?

Yes, we could split the functions into two different one, we'd create a little bit of duplication but in my opinion that would be fine for the purpose of avoid doing that.

However, I don't agree that we shouldn't care about node referrals only because in the result there won't be any. This is one of the main issues I'm having with DEPA.

If it's critical to resolve this issue right now, we can try dampening it by commenting getEditionOrigElement and changing the way to distinguish between front and body content, but the structure-xml-parser.service.ts needs a proper refactoring that includes the preservation of node referrals and the calling of a class that implements the parse method