dankito/Readability4J

Substack link gives java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.String org.jsoup.nodes.Element.attr(java.lang.String)' on a null object reference

Pranoy1c opened this issue ยท 7 comments

Testing with this URL:

https://leratofoods.substack.com

Raw HTML available at:
https://pastebin.com/raw/ejQXETcV

When using:

Readability4JExtended(baseURL, rawHTML).parse().apply {
//.....
}

Throws error:

java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.String org.jsoup.nodes.Element.attr(java.lang.String)' on a null object reference

at .......ArticleGrabber.getTextDirection(ArticleGrabber.kt:1123)
at .......ArticleGrabber.grabArticle(ArticleGrabber.kt:167)
at .......ArticleGrabber.grabArticle$default(ArticleGrabber.kt:57)
at .......Readability4J.parse(Readability4J.kt:95)
at .......CustomWebView$tryParsingForReader$1$1.invokeSuspend(WebViewActivity.kt:191)

Error is coming from here:

protected open fun getTextDirection(topCandidate: Element, doc: Document) {
        val ancestors = Arrays.asList<Element>(topCandidate.parent(), topCandidate).toMutableSet()
        ancestors.addAll(getNodeAncestors(topCandidate.parent()))
        ancestors.add(doc.body())
        ancestors.add(doc.selectFirst("html")) // needed as dir is often set on html tag

        ancestors.forEach { ancestor ->
            val articleDir = ancestor.attr("dir")
            if(articleDir.isNotBlank()) {
                this.articleDir = articleDir
                return
            }
        }
    }

For some reason the ancestor at index 2 of the forEach is null which is because the doc parameter is empty string. The doc seems to become empty string in the caller function grabArticle at this line:

val elementsToScore = prepareNodes(doc, options)

The prepareNodes is doing a lot of work so that's a bit beyond my understanding to figure out why it's becoming empty string.

For now I have added a workaround by changing it to which does make it work successfully:

val articleDir = ancestor?.attr("dir") ?: ""

ktxed commented

I'm encountering a similar NPE when using Readability4J in a Java 11 project.
My stacktrace:

java.lang.IllegalArgumentException: Object must not be null
	at org.jsoup.helper.Validate.notNull(Validate.java:16)
	at org.jsoup.nodes.Attribute.<init>(Attribute.java:31)
	at org.jsoup.nodes.Attributes.put(Attributes.java:48)
	at org.jsoup.nodes.Node.attr(Node.java:100)
	at org.jsoup.nodes.Element.attr(Element.java:116)
	at org.jsoup.nodes.Element.val(Element.java:1132)
	at net.dankito.readability4j.processor.Preprocessor$removeScripts$1.invoke(Preprocessor.kt:43)
	at net.dankito.readability4j.processor.Preprocessor$removeScripts$1.invoke(Preprocessor.kt:12)
	at net.dankito.readability4j.processor.ProcessorBase.removeNodes(ProcessorBase.kt:24)
	at net.dankito.readability4j.processor.Preprocessor.removeScripts(Preprocessor.kt:42)
	at net.dankito.readability4j.processor.Preprocessor.prepareDocument(Preprocessor.kt:26)
	at net.dankito.readability4j.Readability4J.parse(Readability4J.kt:97)

(using 1.0.6)

NOTE: The same html is successfully parsed by readability.js
my url: https://www.handelsblatt.com/dpa/wirtschaft-handel-und-finanzen-stickoxid-ausstoss-opel-ruestet-software-aelterer-diesel-fahrzeuge-nach/27170884.html?ticket=ST-7148859-73eitcVGrOAsthWEXb4p-ap1

@ktxed The workaround solution I mentioned works on your link too.

Hey Folks,

first of all: thank you all and esp. @dankito very much for readability4j, its awesome!

I stumbled on the same issue and could pin it down for my test canditates:
https://www.stateright.rs/seeking-consensus.html

The issue at hand is, that net.dankito.readability4j.processor.ArticleGrabber does some pruning based on heuristics. What causes my site to crash readability4j and apparently also @Pranoy1c and @ktxed sites, is that

// Remove unlikely candidates
             if(options.stripUnlikelyCandidates) {
                  if(regEx.isUnlikelyCandidate(matchString) &&
                          regEx.okMaybeItsACandidate(matchString) == false &&
                          node.tagName() != "body" &&
                          node.tagName() != "a") {
                      node = this.removeAndGetNext(node, "Removing unlikely candidate")
                      continue
                  }
              }

removes the <html> node due to the added attributes on these sites. I can prevent this crash if I prune the attributes on the doc before handing over to readability4j, so no need to change any source of readability4j.

        Document doc = Jsoup.connect(warticle.url).get();
        removeAttributes(doc.getElementsByTag("html").first());
        Readability4J readability4J = new Readability4JExtended(warticle.url, doc);

       private static void removeAttributes(Element e){
        final List<String>  attToRemove = new ArrayList<>();
        final Attributes at = e.attributes();
        for (final Attribute a : at) {
            attToRemove.add(a.getKey());
        }

        for(final String att : attToRemove) {
            e.removeAttr(att);
        }
    }

I dont know if there is a nicer solution but it would help a lot if ArticleGrabber wouldnt remove the html node :)

@dankito Thanks for your effort and this amazing library. I'm getting the same exception for a few URLs. e.g.
https://www.theinvestorspodcast.com/billionaire-book-club-executive-summary/money-master-the-game/

Do you have any estimate when we could have the fix?

ktxed commented

@zjamshidi can you push to maven central ? ah, nevermind, I realized you pushed the fix to master in your fork :)

I fixed it with version 1.0.7 by filtering out null values (but best use version 1.0.8 now).