Wiki.getPageInfo() chokes on HTML entities
PeterBowman opened this issue · 2 comments
This method (and perhaps others, too) builds an internal map of API results in which the keys are page titles as normalized(/encoded) by the MW server. It depends on the implementation of Wiki.normalize(String)
to map a result to a requested title, filling the output string array with null values whenever such normalization fails.
wiki-java/src/org/wikipedia/Wiki.java
Lines 1704 to 1714 in 48268ff
I found out that titles with HTML entities, when passed on to Wiki.getPageInfo()
, are correctly encoded in the post request, then processed by MW to produce an info object, which is finally read by Wiki.makeApiCall()
. However, I am getting a null
value on line 1708 because Wiki.normalize()
does not escape such entities.
Example:
Wiki wiki = Wiki.createInstance("pl.wiktionary.org");
wiki.version(); // just in case, this is a wgCapitalLinks=false wiki
wiki.getPageInfo(new String[] { "1 000 000 000", "1 000 000 000" });
Output:
{size=27, lastpurged=2017-12-16T20:13:11Z, exists=true, protection={cascade=false}, pageid=129007, displaytitle=1000000000, lastrevid=655176, inputpagename=1000000000, pagename=1000000000, timestamp=2018-09-24T19:02:55.981+02:00}
null
Same happens for ©
(©
).
This issue propagates to Wiki.exists()
. The following will cause a null pointer exception which I managed to avoid via PeterBowman@8ccae5a:
wiki.exists(new String[] { "1 000 000 000" });
Acknowledged. Not sure of the best solution yet without requiring external dependencies.