MER-C/wiki-java

Wiki.getPageInfo() chokes on HTML entities

PeterBowman opened this issue · 2 comments

This method (and perhaps others, too) builds an internal map of API results in which the keys are page titles as normalized(/encoded) by the MW server. It depends on the implementation of Wiki.normalize(String) to map a result to a requested title, filling the output string array with null values whenever such normalization fails.

Map<String, Object>[] info = new HashMap[pages.length];
// Reorder. Make a new HashMap so that inputpagename remains unique.
for (int i = 0; i < pages2.length; i++)
{
Map<String, Object> tempmap = metamap.get(normalize(pages2[i]));
if (tempmap != null)
{
info[i] = new HashMap<>(tempmap);
info[i].put("inputpagename", pages[i]);
}
}

I found out that titles with HTML entities, when passed on to Wiki.getPageInfo(), are correctly encoded in the post request, then processed by MW to produce an info object, which is finally read by Wiki.makeApiCall(). However, I am getting a null value on line 1708 because Wiki.normalize() does not escape such entities.

Example:

Wiki wiki = Wiki.createInstance("pl.wiktionary.org");
wiki.version(); // just in case, this is a wgCapitalLinks=false wiki
wiki.getPageInfo(new String[] { "1 000 000 000", "1&nbsp;000&nbsp;000&nbsp;000" });

Output:

{size=27, lastpurged=2017-12-16T20:13:11Z, exists=true, protection={cascade=false}, pageid=129007, displaytitle=1000000000, lastrevid=655176, inputpagename=1000000000, pagename=1000000000, timestamp=2018-09-24T19:02:55.981+02:00}
null

Same happens for &copy; (©).

This issue propagates to Wiki.exists(). The following will cause a null pointer exception which I managed to avoid via PeterBowman@8ccae5a:

wiki.exists(new String[] { "1&nbsp;000&nbsp;000&nbsp;000" });
MER-C commented

Acknowledged. Not sure of the best solution yet without requiring external dependencies.