KevinPayravi/indie-wiki-buddy

URL-encoding of Main Page names breaks redirection

Opened this issue · 3 comments

Breaking out a pull request discussion into its own issue. Originates from here: #624 (comment)_

Issue

Indie Wiki Buddy assumes that Main Page names are unencoded, so URL-encodes them itself.

This currently causes an issue for The Coffin of Andy and Leyley Wiki on Fandom, as its Main Page is titled "The Coffin of Andy & LeyLey Wiki". IWB currently stores this in sitesEN.json as The_Coffin_of_Andy_%26_LeyLey_Wiki, which at runtime IWB seems to double-encode as The_Coffin_of_Andy_%2526_LeyLey_Wiki.

The same applies to the Main Pages of the Fear & Hunger wikis, but it's less noticeable because both wikis have the same Main Page title ("Fear and Hunger: the Tormentpedia" Wiki). Fortunately, the double-encoding doesn't seem to happen when passing parameters to the search URL.

Solutions

I think there are a few ways to approach this issue:

  • Ensure all Main Page values are non-URL-encoded
  • Ensure all Main Page values are URL-encoded
  • Let either be valid inputs, and IWB detects whether the title is encoded or not

My preference is actually the first option. I think URL-encoding all Main Page titles in the sites JSON files would make the Main Page titles far less readable for languages that aren't written in the Latin alphabet (e.g. Chinese, Japanese, Russian, etc.).

If the sites JSON files are non-URL-encoded, they could also use the true values of Main Page titles (i.e. spaces instead of underscores for MediaWiki titles) and let IWB encode them at runtime. Although the handling of spaces would need to be platform-dependant: MediaWiki and DokuWiki replace spaces with underscores, and DokuWiki also lowercases the whole name; Fextralife replaces spaces with plus signs.

For the third option: % is an invalid character in MediaWiki page titles, so any MediaWiki Main Page entry including % can be presumed to be URL-encoded. On DokuWiki, % will not appear in URLs, but may appear in the display names of pages (Page Names on DokuWiki). I have no idea whether % is allowed in Fextralife page titles, and if it is, how they handle it in URLs, but I haven't encountered any pages that use it yet (they do support non-ASCII characters in page titles though, in which case they are percent-encoded in URLs).

I'd also like to add that this seems to be causing a problem for the 100% Orange Juice wiki. If IWB sees a search result for the Fandom wiki's main page, or if you visit said main page yourself, it does not do anything in either circumstance, even if you do have it set to do something. It only takes action if you search for or visit another page on that wiki. Is the percent sign in the page name somehow making IWB not realize it's the wiki to block?

I'd also like to add that this seems to be causing a problem for the 100% Orange Juice wiki. If IWB sees a search result for the Fandom wiki's main page, or if you visit said main page yourself, it does not do anything in either circumstance, even if you do have it set to do something. It only takes action if you search for or visit another page on that wiki. Is the percent sign in the page name somehow making IWB not realize it's the wiki to block?

I noticed that as well and committed a fix (36b34ec), will be in the next update. The percent sign throws an error when trying to decode.

I'd also like to add that this seems to be causing a problem for the 100% Orange Juice wiki. If IWB sees a search result for the Fandom wiki's main page, or if you visit said main page yourself, it does not do anything in either circumstance, even if you do have it set to do something. It only takes action if you search for or visit another page on that wiki. Is the percent sign in the page name somehow making IWB not realize it's the wiki to block?

Oh, I apologize. I'm not sure where I got the idea that "% is an invalid character in MediaWiki page titles". That's definitely not the case, as evidenced by 100% Orange Juice wiki. There's also plenty of pages on Wikipedia with percent signs in their titles (e.g. 100% renewable energy).

The actual restriction is stated on this page

A title can normally contain the character %. However it cannot contain % followed by two hexadecimal digits (which would cause it to be converted to a single character, by percent-encoding). Similarly a title cannot contain HTML character entities such as / and –, even if the character they represent is allowed. In the unlikely event of such sequences appearing in a desired title, an alternative title must be constructed (for example by inserting a space after the %, or omitting a semicolon).

So it would still be possible to detect if a page title is percent-encoded by checking if it includes percent signs followed by two hexadecimal digits.