mozilla/readability

Telegraph articles failing

Closed this issue · 3 comments

Repro URL: http://www.telegraph.co.uk/news/2017/11/16/zimbabwes-robert-mugabe-wife-grace-insisting-finishes-term-priest/

Problem: missing paragraphs from the article

Repro code:

var read = require('node-readability');

read('http://www.telegraph.co.uk/news/2017/11/16/zimbabwes-robert-mugabe-wife-grace-insisting-finishes-term-priest/', function(err, article, meta) {

    // Main Article
    console.log(article.textBody);

    // Close article to clean up jsdom and prevent leaks
    article.close();
});

After some investigation, the missing paragraph is the following one:

<div class="component-content">
<p><span class="m_first-letter m_first-letter--flagged">Z</span>imbabwean intelligence reports seen by Reuters suggest that former security chief Emmerson Mnangagwa, who was ousted as vice-president this month, has been mapping out a post-Mugabe vision with the military and opposition for more than a year.</p>
</div>

The <span> surrounding the Z is skipping the hasSinglePInsideElement function and the conversion to a P node. This leaves the node with 0 score and never gets added to the article when scanning siblings of the topCandidate.

Not sure what a possible fix would be.

By the way, about half of the paragraphs are missing, not just one.

gijsk commented

Going to close this per #408 . :-)