[BUG] Duplicated nodes in output from `to_html()` and `to_text()`
jinkjonks opened this issue · 4 comments
Block json:
[
{
"bbox": [],
"block_class": "nlm-text-header",
"block_idx": 0,
"level": 0,
"page_idx": 0,
"sentences": ["History"],
"tag": "header"
},
{
"bbox": [],
"block_class": "nlm-text-header",
"block_idx": 1,
"level": 1,
"page_idx": 0,
"sentences": ["Jenny"],
"tag": "header"
},
{
"bbox": [],
"block_class": "nlm-text-header",
"block_idx": 2,
"level": 2,
"page_idx": 0,
"sentences": ["Birth"],
"tag": "header"
},
{
"bbox": [],
"block_class": "nlm-text-body",
"block_idx": 3,
"level": 3,
"page_idx": 0,
"sentences": [
"Ship of Fools is a satirical allegory in German verse published in 1949 in Basel, Switzerland by the humanist and theologian Segbastian Brant.",
"An infant Jenny was discovered on board in a wicker basket and tossed overboard to make room for Lord Nelson’s barrel.",
"She miraculously floated to an island which was later named Australia."
],
"tag": "para"
},
{
"bbox": [],
"block_class": "nlm-text-header",
"block_idx": 4,
"level": 1,
"page_idx": 0,
"sentences": ["Cat in the Hat "],
"tag": "header"
},
{
"bbox": [],
"block_class": "nlm-text-header",
"block_idx": 5,
"level": 2,
"page_idx": 0,
"sentences": ["Dr Seuss Novel"],
"tag": "header"
},
{
"bbox": [],
"block_class": "nlm-text-body",
"block_idx": 6,
"level": 3,
"page_idx": 0,
"sentences": [
"“The Cat in the Hat” was well-received however, future historians agree that this was not Dr Seuss’s best work.",
"Seuss tried again a year later to garner better affections from his audience by publishing its sequel “The Cat in the Hat Comes Back” to no avail.",
"Seuss would go on to publish “Green Eggs and Ham” three years later in 1960 which garnered critical acclaim."
],
"tag": "para"
},
{
"bbox": [],
"block_class": "nlm-text-header",
"block_idx": 7,
"level": 2,
"page_idx": 0,
"sentences": ["2003 Live Movie Adaptation"],
"tag": "header"
},
{
"bbox": [],
"block_class": "nlm-text-body",
"block_idx": 8,
"level": 3,
"page_idx": 0,
"sentences": ["Widely regarded as a cult classic."],
"tag": "para"
},
{
"bbox": [],
"block_class": "nlm-text-header",
"block_idx": 9,
"level": 1,
"page_idx": 0,
"sentences": ["Hoy’s Self-Defenestration of 1993"],
"tag": "header"
},
{
"bbox": [],
"block_class": "nlm-text-body",
"block_idx": 10,
"level": 2,
"page_idx": 0,
"sentences": ["This is an event."],
"tag": "para"
},
{
"bbox": [],
"block_class": "nlm-text-header",
"block_idx": 11,
"level": 1,
"page_idx": 0,
"sentences": ["Other Notable Events"],
"tag": "header"
},
{
"bbox": [],
"block_class": "nlm-text-body",
"block_idx": 12,
"level": 2,
"page_idx": 0,
"sentences": ["1. The Hyatt Regency Walkway Collapse"],
"tag": "para"
},
{
"bbox": [],
"block_class": "nlm-text-body",
"block_idx": 13,
"level": 2,
"page_idx": 0,
"sentences": ["2. George Miller’s Happy Feet"],
"tag": "para"
},
{
"bbox": [],
"block_class": "nlm-text-body",
"block_idx": 14,
"level": 2,
"page_idx": 0,
"sentences": ["3. Birth of Shadow Wong circa 2012"],
"tag": "para"
},
{
"bbox": [],
"block_class": "nlm-text-header",
"block_idx": 15,
"level": 1,
"page_idx": 0,
"sentences": ["Conclusion"],
"tag": "header"
},
{
"bbox": [],
"block_class": "nlm-text-body",
"block_idx": 16,
"level": 2,
"page_idx": 0,
"sentences": ["Good times."],
"tag": "para"
}
]
Expected:
<html>
<body>
<h1>History</h1>
<h2>Jenny</h2>
<h3>Birth</h3>
<p>
Ship of Fools is a satirical allegory in German verse published in 1949 in
Basel, Switzerland by the humanist and theologian Segbastian Brant. An
infant Jenny was discovered on board in a wicker basket and tossed
overboard to make room for Lord Nelson’s barrel. She miraculously
floated to an island which was later named Australia.
</p>
<h2>Cat in the Hat</h2>
<h3>Dr Seuss Novel</h3>
<p>
“The Cat in the Hat” was well-received however, future
historians agree that this was not Dr Seuss’s best work. Seuss tried
again a year later to garner better affections from his audience by
publishing its sequel “The Cat in the Hat Comes Back” to no
avail. Seuss would go on to publish “Green Eggs and Ham” three
years later in 1960 which garnered critical acclaim.
</p>
<h3>2003 Live Movie Adaptation</h3>
<p>Widely regarded as a cult classic.</p>
<h2>Hoy’s Self-Defenestration of 1993</h2>
<p>This is an event.</p>
<h2>Other Notable Events</h2>
<p>1. The Hyatt Regency Walkway Collapse</p>
<p>2. George Miller’s Happy Feet</p>
<p>3. Birth of Shadow Wong circa 2012</p>
<h2>Conclusion</h2>
<p>Good times.</p>
</body>
</html>
However actual has h2 and all its children repeated
<html>
<body>
<h1>History</h1>
<h2>Jenny</h2>
<h3>Birth</h3>
<p>
Ship of Fools is a satirical allegory in German verse published in 1949 in
Basel, Switzerland by the humanist and theologian Segbastian Brant. An
infant Jenny was discovered on board in a wicker basket and tossed
overboard to make room for Lord Nelson’s barrel. She miraculously
floated to an island which was later named Australia.
</p>
<h2>Cat in the Hat</h2>
<h3>Dr Seuss Novel</h3>
<p>
“The Cat in the Hat” was well-received however, future
historians agree that this was not Dr Seuss’s best work. Seuss tried
again a year later to garner better affections from his audience by
publishing its sequel “The Cat in the Hat Comes Back” to no
avail. Seuss would go on to publish “Green Eggs and Ham” three
years later in 1960 which garnered critical acclaim.
</p>
<h3>2003 Live Movie Adaptation</h3>
<p>Widely regarded as a cult classic.</p>
<h2>Hoy’s Self-Defenestration of 1993</h2>
<p>This is an event.</p>
<h2>Other Notable Events</h2>
<p>1. The Hyatt Regency Walkway Collapse</p>
<p>2. George Miller’s Happy Feet</p>
<p>3. Birth of Shadow Wong circa 2012</p>
<h2>Conclusion</h2>
<p>Good times.</p>
<h2>Jenny</h2>
<h3>Birth</h3>
<p>
Ship of Fools is a satirical allegory in German verse published in 1949 in
Basel, Switzerland by the humanist and theologian Segbastian Brant. An
infant Jenny was discovered on board in a wicker basket and tossed
overboard to make room for Lord Nelson’s barrel. She miraculously
floated to an island which was later named Australia.
</p>
<h3>Birth</h3>
<p>
Ship of Fools is a satirical allegory in German verse published in 1949 in
Basel, Switzerland by the humanist and theologian Segbastian Brant. An
infant Jenny was discovered on board in a wicker basket and tossed
overboard to make room for Lord Nelson’s barrel. She miraculously
floated to an island which was later named Australia.
</p>
<h2>Cat in the Hat</h2>
<h3>Dr Seuss Novel</h3>
<p>
“The Cat in the Hat” was well-received however, future
historians agree that this was not Dr Seuss’s best work. Seuss tried
again a year later to garner better affections from his audience by
publishing its sequel “The Cat in the Hat Comes Back” to no
avail. Seuss would go on to publish “Green Eggs and Ham” three
years later in 1960 which garnered critical acclaim.
</p>
<h3>2003 Live Movie Adaptation</h3>
<p>Widely regarded as a cult classic.</p>
<h3>Dr Seuss Novel</h3>
<p>
“The Cat in the Hat” was well-received however, future
historians agree that this was not Dr Seuss’s best work. Seuss tried
again a year later to garner better affections from his audience by
publishing its sequel “The Cat in the Hat Comes Back” to no
avail. Seuss would go on to publish “Green Eggs and Ham” three
years later in 1960 which garnered critical acclaim.
</p>
<h3>2003 Live Movie Adaptation</h3>
<p>Widely regarded as a cult classic.</p>
<h2>Hoy’s Self-Defenestration of 1993</h2>
<p>This is an event.</p>
<h2>Other Notable Events</h2>
<p>1. The Hyatt Regency Walkway Collapse</p>
<p>2. George Miller’s Happy Feet</p>
<p>3. Birth of Shadow Wong circa 2012</p>
<h2>Conclusion</h2>
<p>Good times.</p>
</body>
</html>
Similar output for to_text()
Debug from LayoutReader
which does not have the same problem:
header (5) History
- header (1) Jenny
-- header (1) Birth
--- para (0) Ship of Fools is a satirical allegory in German verse published in 1949 in Basel, Switzerland by the humanist and theologian Segbastian Brant.
An infant Jenny was discovered on board in a wicker basket and tossed overboard to make room for Lord Nelson’s barrel.
She miraculously floated to an island which was later named Australia.
- header (2) Cat in the Hat
-- header (1) Dr Seuss Novel
--- para (0) “The Cat in the Hat” was well-received however, future historians agree that this was not Dr Seuss’s best work.
Seuss tried again a year later to garner better affections from his audience by publishing its sequel “The Cat in the Hat Comes Back” to no avail.
Seuss would go on to publish “Green Eggs and Ham” three years later in 1960 which garnered critical acclaim.
-- header (1) 2003 Live Movie Adaptation
--- para (0) Widely regarded as a cult classic.
- header (1) Hoy’s Self-Defenestration of 1993
-- para (0) This is an event.
- header (3) Other Notable Events
-- para (0) 1. The Hyatt Regency Walkway Collapse
-- para (0) 2. George Miller’s Happy Feet
-- para (0) 3. Birth of Shadow Wong circa 2012
- header (1) Conclusion
-- para (0) Good times.
Additional notes:
- No duplication occurs over
chunks()
. - Duplication occurs over
sections()
I wrote this workaround which extends Document
I also found the same behaviour. I was trying to parse this pdf and noticed that some of the text appears twice in the output of Document.to_text()
method.
The issue occurs when the document tree has the following (or similar) structure:
Section-1
├── Section-2
├── Section-3
Here is the current implementation of the to_text
method:
def to_text(self):
"""
Returns text of a document by iterating through all the sections '\n'
"""
text = ""
for section in self.sections():
text = text + section.to_text(include_children=True, recurse=True) + "\n"
return text
Therefore, the text of Section-2
is included in the output when the to_text
method is called (recursively) for Section-1
as well as forSection-2
.
Similarly, the text of Section-3
is also duplicated in the output.
Hi there, I noticed the latest release failed, and the fix for this issue doesn't seem to have been published yet. I would appreciate if you could take a look when you have a chance, please.