URL fragments and title normalization

Currently, Wiki.normalize looks like this:

Lines 8546 to 8604 in bd1b122

    
               /** 
        
                *  Convenience method for normalizing MediaWiki titles. (Converts all 
        
                *  underscores to spaces, localizes namespace names, fixes case of first 
        
                *  char and does some other unicode fixes). 
        
                *  @param s the string to normalize 
        
                *  @return the normalized string 
        
                *  @throws IllegalArgumentException if the title is invalid 
        
                *  @throws UncheckedIOException if the namespace cache has not been 
        
                *  populated, and a network error occurs when populating it 
        
                *  @since 0.27 
        
                */ 
        
               public String normalize(String s) 
        
               { 
        
                   s = s.replace('_', ' ').trim(); 
        
                   // remove leading colon 
        
                   if (s.startsWith(":")) 
        
                       s = s.substring(1); 
        
                   if (s.isEmpty()) 
        
                       throw new IllegalArgumentException("Empty or whitespace only title."); 
        
                   int ns = namespace(s); 
        
                   // localize namespace names 
        
                   if (ns != MAIN_NAMESPACE) 
        
                   { 
        
                       int colon = s.indexOf(':'); 
        
                       s = namespaceIdentifier(ns) + s.substring(colon); 
        
                   } 
        
                   char[] temp = s.toCharArray(); 
        
                   if (wgCapitalLinks) 
        
                   { 
        
                       // convert first character in the actual title to upper case 
        
                       if (ns == MAIN_NAMESPACE) 
        
                           temp[0] = Character.toUpperCase(temp[0]); 
        
                       else 
        
                       { 
        
                           int index = namespaceIdentifier(ns).length() + 1; // + 1 for colon 
        
                           temp[index] = Character.toUpperCase(temp[index]); 
        
                       } 
        
                   } 
        
                   for (int i = 0; i < temp.length; i++) 
        
                   { 
        
                       switch (temp[i]) 
        
                       { 
        
                           // illegal characters 
        
                           case '{': 
        
                           case '}': 
        
                           case '<': 
        
                           case '>': 
        
                           case '[': 
        
                           case ']': 
        
                           case '|': 
        
                               throw new IllegalArgumentException(s + " is an illegal title"); 
        
                       } 
        
                   } 
        
                   // https://mediawiki.org/wiki/Unicode_normalization_considerations 
        
                   String temp2 = new String(temp).replaceAll("\\s+", " "); 
        
                   return Normalizer.normalize(temp2, Normalizer.Form.NFC); 
        
               }

Since MW API requests strip URL fragments upon title normalization (example), I believe Wiki.java should automatically delete everything from the # character onwards, too. The current implementation has some weird side effects:

Wiki wiki = Wiki.createInstance("pl.wiktionary.org");
Map<String, Object>[] map = wiki.getPageInfo(new String[] {"rescate", "rescate#es"});
Stream.of(map).forEach(System.out::println);

Output:

{size=3452, lastpurged=2018-08-28T11:28:17Z, exists=true, protection={cascade=false}, pageid=353684, displaytitle=rescate, lastrevid=5974053, inputpagename=rescate, pagename=rescate, timestamp=2018-09-09T12:20:42.395+02:00}
null

Lastly, Wiki.getPageInfo has gained the ability to record the page title originally passed on. Therefore, I'd expect both lines to display the same information except for a inputpagename property.

Other Wiki.java methods might be affected by this shortcoming, too.

By the way: although rare and does not cope well with wikilinks, it's perfectly fine to have a section named {whatever}. That is, a section title that contains illegal characters according to Wiki.normalize. Currently, queries performed through this framework may cause exceptions (e.g. pass the title somepage#{whatever} to Wiki.pageInfo). Those could be avoided if Wiki.normalize was able to strip the URL fragment.

	/**
	* Convenience method for normalizing MediaWiki titles. (Converts all
	* underscores to spaces, localizes namespace names, fixes case of first
	* char and does some other unicode fixes).
	* @param s the string to normalize
	* @return the normalized string
	* @throws IllegalArgumentException if the title is invalid
	* @throws UncheckedIOException if the namespace cache has not been
	* populated, and a network error occurs when populating it
	* @since 0.27
	*/
	public String normalize(String s)
	{
	s = s.replace('_', ' ').trim();
	// remove leading colon
	if (s.startsWith(":"))
	s = s.substring(1);
	if (s.isEmpty())
	throw new IllegalArgumentException("Empty or whitespace only title.");

	int ns = namespace(s);
	// localize namespace names
	if (ns != MAIN_NAMESPACE)
	{
	int colon = s.indexOf(':');
	s = namespaceIdentifier(ns) + s.substring(colon);
	}
	char[] temp = s.toCharArray();
	if (wgCapitalLinks)
	{
	// convert first character in the actual title to upper case
	if (ns == MAIN_NAMESPACE)
	temp[0] = Character.toUpperCase(temp[0]);
	else
	{
	int index = namespaceIdentifier(ns).length() + 1; // + 1 for colon
	temp[index] = Character.toUpperCase(temp[index]);
	}
	}

	for (int i = 0; i < temp.length; i++)
	{
	switch (temp[i])
	{
	// illegal characters
	case '{':
	case '}':
	case '<':
	case '>':
	case '[':
	case ']':
	case '\|':
	throw new IllegalArgumentException(s + " is an illegal title");
	}
	}
	// https://mediawiki.org/wiki/Unicode_normalization_considerations
	String temp2 = new String(temp).replaceAll("\\s+", " ");
	return Normalizer.normalize(temp2, Normalizer.Form.NFC);
	}