MER-C/wiki-java

URL fragments and title normalization

PeterBowman opened this issue · 1 comments

Currently, Wiki.normalize looks like this:

/**
* Convenience method for normalizing MediaWiki titles. (Converts all
* underscores to spaces, localizes namespace names, fixes case of first
* char and does some other unicode fixes).
* @param s the string to normalize
* @return the normalized string
* @throws IllegalArgumentException if the title is invalid
* @throws UncheckedIOException if the namespace cache has not been
* populated, and a network error occurs when populating it
* @since 0.27
*/
public String normalize(String s)
{
s = s.replace('_', ' ').trim();
// remove leading colon
if (s.startsWith(":"))
s = s.substring(1);
if (s.isEmpty())
throw new IllegalArgumentException("Empty or whitespace only title.");
int ns = namespace(s);
// localize namespace names
if (ns != MAIN_NAMESPACE)
{
int colon = s.indexOf(':');
s = namespaceIdentifier(ns) + s.substring(colon);
}
char[] temp = s.toCharArray();
if (wgCapitalLinks)
{
// convert first character in the actual title to upper case
if (ns == MAIN_NAMESPACE)
temp[0] = Character.toUpperCase(temp[0]);
else
{
int index = namespaceIdentifier(ns).length() + 1; // + 1 for colon
temp[index] = Character.toUpperCase(temp[index]);
}
}
for (int i = 0; i < temp.length; i++)
{
switch (temp[i])
{
// illegal characters
case '{':
case '}':
case '<':
case '>':
case '[':
case ']':
case '|':
throw new IllegalArgumentException(s + " is an illegal title");
}
}
// https://mediawiki.org/wiki/Unicode_normalization_considerations
String temp2 = new String(temp).replaceAll("\\s+", " ");
return Normalizer.normalize(temp2, Normalizer.Form.NFC);
}

Since MW API requests strip URL fragments upon title normalization (example), I believe Wiki.java should automatically delete everything from the # character onwards, too. The current implementation has some weird side effects:

Wiki wiki = Wiki.createInstance("pl.wiktionary.org");
Map<String, Object>[] map = wiki.getPageInfo(new String[] {"rescate", "rescate#es"});
Stream.of(map).forEach(System.out::println);

Output:

{size=3452, lastpurged=2018-08-28T11:28:17Z, exists=true, protection={cascade=false}, pageid=353684, displaytitle=rescate, lastrevid=5974053, inputpagename=rescate, pagename=rescate, timestamp=2018-09-09T12:20:42.395+02:00}
null

Lastly, Wiki.getPageInfo has gained the ability to record the page title originally passed on. Therefore, I'd expect both lines to display the same information except for a inputpagename property.

Other Wiki.java methods might be affected by this shortcoming, too.

By the way: although rare and does not cope well with wikilinks, it's perfectly fine to have a section named {whatever}. That is, a section title that contains illegal characters according to Wiki.normalize. Currently, queries performed through this framework may cause exceptions (e.g. pass the title somepage#{whatever} to Wiki.pageInfo). Those could be avoided if Wiki.normalize was able to strip the URL fragment.