ageitgey/node-unfluff

try to take div itemprop="articleBody" into account

hecmec opened this issue · 0 comments

Hello,
thanks for your module, it is working nicely.

I've had just a little issue with text extraction.
Your calculateBestNode() function doesn't take div or article into account and it will not check for schema.org itemprop="articleBody". But nodes marked with this itemprop are pretty good candidates.

Example:
http://www.lemonde.fr/election-presidentielle-2017/article/2016/12/02/et-hollande-renonca-a-se-representer_5042285_4854003.html
Your module extracts the parent.parent of the article and so takes the content-menu as text.

Thanks
Hector