URL fragments and title normalization
PeterBowman opened this issue · 1 comments
Currently, Wiki.normalize
looks like this:
wiki-java/src/org/wikipedia/Wiki.java
Lines 8546 to 8604 in bd1b122
Since MW API requests strip URL fragments upon title normalization (example), I believe Wiki.java should automatically delete everything from the #
character onwards, too. The current implementation has some weird side effects:
Wiki wiki = Wiki.createInstance("pl.wiktionary.org");
Map<String, Object>[] map = wiki.getPageInfo(new String[] {"rescate", "rescate#es"});
Stream.of(map).forEach(System.out::println);
Output:
{size=3452, lastpurged=2018-08-28T11:28:17Z, exists=true, protection={cascade=false}, pageid=353684, displaytitle=rescate, lastrevid=5974053, inputpagename=rescate, pagename=rescate, timestamp=2018-09-09T12:20:42.395+02:00}
null
Lastly, Wiki.getPageInfo
has gained the ability to record the page title originally passed on. Therefore, I'd expect both lines to display the same information except for a inputpagename
property.
Other Wiki.java methods might be affected by this shortcoming, too.
By the way: although rare and does not cope well with wikilinks, it's perfectly fine to have a section named {whatever}
. That is, a section title that contains illegal characters according to Wiki.normalize
. Currently, queries performed through this framework may cause exceptions (e.g. pass the title somepage#{whatever}
to Wiki.pageInfo
). Those could be avoided if Wiki.normalize
was able to strip the URL fragment.