MER-C/wiki-java

Unaccessible gender-aware namespace aliases

PeterBowman opened this issue · 2 comments

On some non-English language projects, a dedicate user namespace prefix alias is assigned to users that choose to pick female gender in their preferences. For instance, on plwiki male/unspecified gender users get the default Wikipedysta prefix, whereas female ones are identified with Wikipedystka (cf. Benutzer/Benutzerin on German projects, Usuario/Usuaria on Spanish wikis and so on).

Wiki.java automatically falls back to the default/male language-specific prefix upon normalization. It is not different from other normalization use cases, i.e. (for plwiki) User->Wikipedysta, wikipedysta->Wikipedysta, Wikipedystka->Wikipedysta. However, MediaWiki honors the gender setting when a user page is queried.

Let's query w:pl:User:Cancre (on-wiki displayed as Wikipedystka:Cancre, female prefix alias) and also w:pl:User:Przykuta (Wikipedysta:Przykuta, male/default prefix) just for comparison (api.php):

<?xml version="1.0"?>
<api batchcomplete="">
  <query>
    <normalized>
      <n from="User:Cancre" to="Wikipedystka:Cancre" />
      <n from="User:Przykuta" to="Wikipedysta:Przykuta" />
    </normalized>
    <pages>
      <page _idx="320152" pageid="320152" ns="2" title="Wikipedystka:Cancre" />
      <page _idx="93794" pageid="93794" ns="2" title="Wikipedysta:Przykuta" />
    </pages>
  </query>
</api>

Wiki.java expects the normalized page name to also fall back to the male/default prefix (Wikipedysta:Cancre). It can't find it in the pages array, though, because of the special treatment of gender aliases in this specific namespace. Example:

var wiki = Wiki.newSession("pl.wikipedia.org");
wiki.getPageInfo(List.of("User:Cancre", "User:Przykuta")).forEach(System.out::println);

Result (first line refers to User:Cancre):

null
{redirect=false, size=550, lastpurged=2018-09-06T04:27:13Z, exists=true, watchers=159, protection={editexpiry=null, move=autoconfirmed, edit=autoconfirmed, cascade=false, moveexpiry=null}, pageid=93794, displaytitle=Wikipedysta:Przykuta, lastrevid=44294744, inputpagename=User:Przykuta, pagename=Wikipedysta:Przykuta, timestamp=2021-04-02T19:03:26.546280+02:00}

Reason: Wiki.java calls normalize() internally and reorders the query results according to the input titles. This normalize() method does not take into account the gender of the underlying user a user page refers to. The following scheme can be found in several places, e.g. getPageInfo():

// Reorder. Make a new HashMap so that inputpagename remains unique.
for (int i = 0; i < pages2.size(); i++)
{
Map<String, Object> tempmap = metamap.get(normalize(pages2.get(i)));
if (tempmap != null)
{
info[i] = new HashMap<>(tempmap);
info[i].put("inputpagename", pages.get(i));
}
}

Since getPageInfo() is always called internally by edit(), this bug makes it impossible to edit user pages prefixed with female aliases on gender-aware language wikis.

Possible solution: parse the <normalized> element if present and use that information instead of normalize() to link query results with input titles. I'd implement some sort of resolveNormalizedParser() helper method (analogous to resolveRedirectParser()) for that matter. The existing normalize() method would be explicitly documented to serve limited offline-based title normalization purposes, remarking that it's not fully aware of certain quirks (such as gender aliasing) for obvious reasons.

Bonus: solving this would also solve #162.

@MER-C are you OK with this proposal? I'd be happy to work on a patch if so.

MER-C commented

Sounds good.