Text missing
akreienbring opened this issue · 5 comments
Hi!
When I look at the page source of a html page I can see that there's text in some span or p tags. But this text does'nt show up in the result that unfluff returns when scraping.
Question: Why not? What can I do to extract all the text from the HTML document? Any configuration of the filtering that is applied?
Hi, the goal of unfluff is to find the "main content" of a page. This means the biggest piece of text (such as the main article body on a news site).
It's not a perfect algorithm, but it works well in most cases for news sites, etc. If you find a page where it doesn't work correctly, I can try to take a look but I can't promise it will work for every page.
If you just want all the text in an html page with no limitations, you could do that simply in jQuery with something like this:
$("body").text()
Thank you! I'm searching something like this. The question is if the important part of the text is extracted. Is there any article / documentation about the algorithm that unfluff uses?
I'm planning to mesh up tools that are able to measure the relatedness / similarity of websites. Therefore the extraction of the relevant text is the essential first step in the progress.
Unfluff is originally based on the Goose and the algorithm it uses. So the best description of the algorithm is in their docs here: https://github.com/jiminoc/goose/wiki#workflow
Hi,
just to let you know. I just created a server that uses unfluff for the text extraction part. If you like to have a look: http://b-semantic.elasticbeanstalk.com/public
Cool! Thanks!