Telegraph articles failing
Closed this issue · 3 comments
NinoSkopac commented
Problem: missing paragraphs from the article
Repro code:
var read = require('node-readability');
read('http://www.telegraph.co.uk/news/2017/11/16/zimbabwes-robert-mugabe-wife-grace-insisting-finishes-term-priest/', function(err, article, meta) {
// Main Article
console.log(article.textBody);
// Close article to clean up jsdom and prevent leaks
article.close();
});
andreskrey commented
After some investigation, the missing paragraph is the following one:
<div class="component-content">
<p><span class="m_first-letter m_first-letter--flagged">Z</span>imbabwean intelligence reports seen by Reuters suggest that former security chief Emmerson Mnangagwa, who was ousted as vice-president this month, has been mapping out a post-Mugabe vision with the military and opposition for more than a year.</p>
</div>
The <span>
surrounding the Z is skipping the hasSinglePInsideElement function and the conversion to a P node. This leaves the node with 0 score and never gets added to the article when scanning siblings of the topCandidate.
Not sure what a possible fix would be.
NinoSkopac commented
By the way, about half of the paragraphs are missing, not just one.