dankito/Readability4J

Keep specific parts of page

zjamshidi opened this issue · 6 comments

Hi
Is it any way to force keep specific tags/classes in the output? Somehow ignore some tags?
To be more clear, I want to have the table of content after converting this link

I tried specifying the class "lwptoc_i" in additionalClassesToPreserve, but it doesn't help.

Thanks again for your great library.

No, additionalClassesToPreserve sadly does not work.

Actually it should have been named better (took that name from readability.js).

What it does: By default Readability removes in a post processing step all class names from all elements, only "readability-styled" and "page" are kept.

If you want to have more class names on elements to be preserved than just the default ones, add them to additionalClassesToPreserve.

So it preserves only the class names on the elements, not the elements with that class itself.

Your problem here is the single TOC elements have a too short text length (contentLength < 25), so that step by step each TOC element gets removed and in the end the whole TOC.
See ArticleGrabber.prepArticle() -> this.cleanConditionally(articleContent, "div", options) -> val haveToRemove = ...

If you can provide a clean, generic solution it could fix that (or preferable give me a pull request).

I see.
Could we provide a list of classes to prepArticle method and if the tag's class attribute is in the list set haveToRemove to false, else check for other criteria?

@dankito any update on this issue? Now the problem is more critical for us. None of the H1 heading tags are shown in the output.

Oh! I just saw all h1 tags are removed in prepArticle method. How can we keep them?

This now has also been fixed on 3rd of December in Readability.js upstream (see mozilla/readability@11093f0),

are now also kept there.

For a quick fix their whole commit was too complex to grasp it at a glance, but i now at least removed that

get removed from output.

See version 1.0.6, should be visible in a few hours on Maven Central, and please tell me if everything works correctly for you.

Thank you. It seems the H1 tags are back using the recent version. It helps us alot. You are awesome.