chimbori/crux

Hidden popup chosen as article

GomiGuchi opened this issue · 5 comments

Hi there.

First let me say that Crux is a very useful framework, and that it is the only one I have found so far that can deal with CJK.

I have found one instance of a website so far where the main article content is in a div with "article" class, which even contains an article element, also with "article" in the class, and yet a hidden div with text for creating a profile is always chosen as the article content instead. I think perhaps the weight of the class/tag "article" is not high enough, or child elements of something that is hidden on the page don't have their scores lowered correctly.

An example article that always fails to be extracted is: https://www.news24.com/World/News/watch-indonesia-frees-bali-nine-drug-smuggler-lawrence-from-prison-20181121

I have been looking but I can't seem to find a way to customise the scoring without forking the project. Is there a way to do it that I just haven't found?

There’s no way to tweak the output without modifying the code. The idea is that any improvements that can be made should be made to the core repo, so everyone can get them.

I’ve added golden file tests; please feel free to fork the repo and tweak the code until the content is properly parsed, then send a PR.

Thank you very much for your response. I have cloned the repo and also created a golden file test for that news article, and despite a few hours of trying to tweak the extractor still have not had any luck.

I have a feeling there is something about the main content of the article that is giving it a low score rather than other elements being scored too highly. Is there anything you can see about the elements the article is nested in that might cause them to be ignored or scored too lowly?

I'll still keep plugging away in the meantime. Thanks again!

So for some reason the good people at News 24 wrap the entire article inside of a form, which gets removed in preprocessing. 🤦‍♂️

The fix in this case is to stop removing all forms in removeScriptsStylesForms. All of the tests still pass. Would you be opposed to this change?

Not at all, feel free to send a PR! As long as all existing tests pass, and all code follows the style guide, this is good to go!

Fixed in pull request #8