When iterating over a layoutList, the page iterator moves on to layoutText after iterating over each individual list item
rsanaie opened this issue · 5 comments
When I'm traversing a layout object, after visiting a layout list, the iterator then moves on to individual list items, causing duplication of nodes visited. Is this intended?
This happens in .html() function as well. I know that you can use the childBlockIds of LayoutList to mark down which nodes have been visited and skip them later, but wanted to confirm this behaviour.
Thank you, you guys are awesome!
const ignoredBlocks: ApiBlockType[] = [
ApiBlockType.LayoutHeader,
ApiBlockType.LayoutFooter,
ApiBlockType.LayoutPageNumber
]
for (const page of doc.iterPages()) {
for (const layItem of page.layout.iterItems()) {
if (!ignoredBlocks.includes(layItem.blockType)) {
console.log(layItem.blockType)
console.log(layItem.text + "\n")
}
}
}
Hi, thanks for raising this!
Need to double-check I understand correctly here, but at first pass this is sounding like something we should fix...
It seems like the current state is:
page.layout.iterItems()
(and the correspondinglistItems()
) includes both theLAYOUT_LIST
and anyLAYOUT_TEXT
children it links to (because they all seem to appear asCHILD
blocks of the parentPAGE
, which is where this list is derived from)- This is confusing for anybody iterating through layout items, because they'd probably assume their code should deal with the
LAYOUT_LIST
all at once when it's seen - like it's iterating the top level of a tree and not a flat collection of all content nodes - It's also causing
page.html()
to duplicate content because the.html()
representation of theLAYOUT_LIST
already includes theLAYOUT_TEXT
items that the iterator picks up next.
Is that matching what you're seeing @rsanaie?
IMO 3/ would definitely be a bug. Agreeing on how to better handle 2/ might involve a few more edge cases...
This single-level List->Text tree is the only structure I'm aware of in current layout results, but at the service API level it seems like there's nothing to prevent adding more in future or defining deep nesting or more general graphs. I'd be really surprised if the service ever started returning circular relationships, but probably wouldn't completely rule out the possibility of shared children one day.
As a general rule, I've also been trying to drive as much parsing and iteration behaviour as possible directly from block Relationship
lists (minimizing internal state like LayoutGeneric._items
) to unblock future mutability of loaded documents at the expense of some runtime performance for certain repeated read operations.
...So I'm initially thinking yes we should change the behaviour of the page.layout.iterItems()
iterator, and probably even make the new behaviour the default, but long-term might need to be a bit more formal in defining e.g. single-level & recursive layout child iterators across the different kinds of LayoutItem
? For e.g. maybe every layout item could have child iterator methods like iterLayoutItems({ deep = false, skipBlockIds = {}})
or something - which would today always return empty except for LAYOUT_LIST
, and which the top-level page.layout.iterItems()
could use (in deep mode) to prevent exposing any nested children
Hi @athewsey yes that's the correct behavior I'm seeing.
I agree with you that #3 is a bug that needs to be fixed as the .html() is supposed to output usable data, but currently it's duplicating unnecessarily.
Using a tree analogy makes sense, maybe we should have an iterator that iterates through all nodes in some specific order, and then another iterator that visits children only (and leverage recursion)?
Hi @rsanaie
We have a draft bugfix release available at v0.4.1-alpha.1 to fix the serialization bug and provide some basic ability to:
- toggle
page.layout.{list/iter}Items()
between recursive and top-level-only behaviour - access
{list/iter}LayoutChildren()
from anyLayoutItem
(although in practice today onlyLAYOUT_LIST
should have anything, I think - rest will always be empty lists.
Maybe you'd be able to try it out and share thoughts?
Originally I was thinking we should make deep: false
the default behaviour of page.layout.{iter/list}Items()
, but maybe it's a breaking change for some folks... Might be worth revisiting for a v0.5 down the road?
It's working great as expected, thank you! It'd be ideal to specify which blocks should be skipped from html such as page number, header/footer) I'll create a new issue for that!
🙇 Thanks for the feedback!
The fix is now released to mainline version v0.4.1 on NPM with the merge of #178, so closing this issue & will refer to #179 for the other ask 👍