Substack link gives java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.String org.jsoup.nodes.Element.attr(java.lang.String)' on a null object reference
Pranoy1c opened this issue ยท 7 comments
Testing with this URL:
https://leratofoods.substack.com
Raw HTML available at:
https://pastebin.com/raw/ejQXETcV
When using:
Readability4JExtended(baseURL, rawHTML).parse().apply {
//.....
}
Throws error:
java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.String org.jsoup.nodes.Element.attr(java.lang.String)' on a null object reference
at .......ArticleGrabber.getTextDirection(ArticleGrabber.kt:1123)
at .......ArticleGrabber.grabArticle(ArticleGrabber.kt:167)
at .......ArticleGrabber.grabArticle$default(ArticleGrabber.kt:57)
at .......Readability4J.parse(Readability4J.kt:95)
at .......CustomWebView$tryParsingForReader$1$1.invokeSuspend(WebViewActivity.kt:191)
Error is coming from here:
protected open fun getTextDirection(topCandidate: Element, doc: Document) {
val ancestors = Arrays.asList<Element>(topCandidate.parent(), topCandidate).toMutableSet()
ancestors.addAll(getNodeAncestors(topCandidate.parent()))
ancestors.add(doc.body())
ancestors.add(doc.selectFirst("html")) // needed as dir is often set on html tag
ancestors.forEach { ancestor ->
val articleDir = ancestor.attr("dir")
if(articleDir.isNotBlank()) {
this.articleDir = articleDir
return
}
}
}
For some reason the ancestor at index 2 of the forEach is null which is because the doc
parameter is empty string. The doc
seems to become empty string in the caller function grabArticle
at this line:
val elementsToScore = prepareNodes(doc, options)
The prepareNodes
is doing a lot of work so that's a bit beyond my understanding to figure out why it's becoming empty string.
For now I have added a workaround by changing it to which does make it work successfully:
val articleDir = ancestor?.attr("dir") ?: ""
I'm encountering a similar NPE when using Readability4J in a Java 11 project.
My stacktrace:
java.lang.IllegalArgumentException: Object must not be null
at org.jsoup.helper.Validate.notNull(Validate.java:16)
at org.jsoup.nodes.Attribute.<init>(Attribute.java:31)
at org.jsoup.nodes.Attributes.put(Attributes.java:48)
at org.jsoup.nodes.Node.attr(Node.java:100)
at org.jsoup.nodes.Element.attr(Element.java:116)
at org.jsoup.nodes.Element.val(Element.java:1132)
at net.dankito.readability4j.processor.Preprocessor$removeScripts$1.invoke(Preprocessor.kt:43)
at net.dankito.readability4j.processor.Preprocessor$removeScripts$1.invoke(Preprocessor.kt:12)
at net.dankito.readability4j.processor.ProcessorBase.removeNodes(ProcessorBase.kt:24)
at net.dankito.readability4j.processor.Preprocessor.removeScripts(Preprocessor.kt:42)
at net.dankito.readability4j.processor.Preprocessor.prepareDocument(Preprocessor.kt:26)
at net.dankito.readability4j.Readability4J.parse(Readability4J.kt:97)
(using 1.0.6)
NOTE: The same html is successfully parsed by readability.js
my url: https://www.handelsblatt.com/dpa/wirtschaft-handel-und-finanzen-stickoxid-ausstoss-opel-ruestet-software-aelterer-diesel-fahrzeuge-nach/27170884.html?ticket=ST-7148859-73eitcVGrOAsthWEXb4p-ap1
Hey Folks,
first of all: thank you all and esp. @dankito very much for readability4j, its awesome!
I stumbled on the same issue and could pin it down for my test canditates:
https://www.stateright.rs/seeking-consensus.html
The issue at hand is, that net.dankito.readability4j.processor.ArticleGrabber
does some pruning based on heuristics. What causes my site to crash readability4j and apparently also @Pranoy1c and @ktxed sites, is that
// Remove unlikely candidates
if(options.stripUnlikelyCandidates) {
if(regEx.isUnlikelyCandidate(matchString) &&
regEx.okMaybeItsACandidate(matchString) == false &&
node.tagName() != "body" &&
node.tagName() != "a") {
node = this.removeAndGetNext(node, "Removing unlikely candidate")
continue
}
}
removes the <html>
node due to the added attributes on these sites. I can prevent this crash if I prune the attributes on the doc before handing over to readability4j, so no need to change any source of readability4j.
Document doc = Jsoup.connect(warticle.url).get();
removeAttributes(doc.getElementsByTag("html").first());
Readability4J readability4J = new Readability4JExtended(warticle.url, doc);
private static void removeAttributes(Element e){
final List<String> attToRemove = new ArrayList<>();
final Attributes at = e.attributes();
for (final Attribute a : at) {
attToRemove.add(a.getKey());
}
for(final String att : attToRemove) {
e.removeAttr(att);
}
}
I dont know if there is a nicer solution but it would help a lot if ArticleGrabber
wouldnt remove the html node :)
Sorry i could only verify: https://leratofoods.substack.com which works with my proposed fix
not: https://www.handelsblatt.com/dpa/wirtschaft-handel-und-finanzen-stickoxid-ausstoss-opel-ruestet-software-aelterer-diesel-fahrzeuge-nach/27170884.html its gone status 410
Da hat Opel wohl schlechte Publicity vermeiden wollen ;)
but this one worked too with the fix:
https://www.handelsblatt.com/politik/international/steuern-angst-vor-der-vermoegensteuer-der-neue-run-auf-stiftungen-in-liechtenstein/27377122.html
@dankito Thanks for your effort and this amazing library. I'm getting the same exception for a few URLs. e.g.
https://www.theinvestorspodcast.com/billionaire-book-club-executive-summary/money-master-the-game/
Do you have any estimate when we could have the fix?
@zjamshidi can you push to maven central ? ah, nevermind, I realized you pushed the fix to master in your fork :)
I fixed it with version 1.0.7 by filtering out null values (but best use version 1.0.8 now).