MER-C/wiki-java

Wiki-java doesn't handle long URLs

Nirvanchik opened this issue · 6 comments

As [http://www.ietf.org/rfc/rfc2616.txt](HTTP spec) says:

A server
SHOULD return 414 (Request-URI Too Long) status if a URI is longer
than the server can handle

So this happens when you call Wiki.exists(pages) with too many those pages in list.
I observed this: 1) In April this year with Wiki.exists(). But that was on my small wiki project where I have a free hosting and so I thought the hoster had limited my account this way becauze free accounts are always limited in any way. I had to disable highlimit in Wiki.java and even after that it was not enough. When I requested randomly a list of templates with a very long titles it failed again and I decreased slowmax to 40 and then to 30 finally. I don't know exact limitation but at that time I discovered its approximate value empirically as 2300 characters.
2) Tonight. Tonight I tried to implement getPagesTemplates(String titles[], int namespace...) and goddamn I see this HTTP 414 again. On real Wikipedia API 👎 :(
I set slowmax to 50 and my getPagesTemplates() works fine.

I think that constructTitleString(String[] titles) must consider MAX URL LENGTH limitation and the limitation should be a class member and be configurable.

I don't know exact limitation of Wikipedia and asked it here: https://www.mediawiki.org/wiki/API_talk:Main_page#What_is_maximum_URL_length_I_can_use_with_Wikipedia_API_.3F

They say the limit is 8192 bytes but this is not guaranteed as server admin can change it. Also this can very from server to server. So I guess the variable maxUrlLength should adapt somehow to this. Probably catch HTTP 414 and handle it.
But the first value is 8192, I checked it and it looks like true. With more than 82xx bytes requests fail.
So, one HTTP GET can take ~50-60 ciryllic titles or ~100 latin titles.

There is another solution to this.
Using POST.
POST make it faster, cauze you can POST 500 titles at once.
But POST is not recommended here: https://www.mediawiki.org/wiki/API:Main_page#API_etiquette
So I propose implement both ways, to set GET as default and add a possibility to switch to POST for those who suffer from performance issues (like me currently).
We can allow setting POST for only 1 next operation or globally for all operations.

2 additional methods:

     /**
     *  Cuts up a list of titles into batches for prop=X&titles=Y type queries.
     *
     *  @param url - url which should be requested with this titles[] list
     *  @param titles a list of titles.
     *  @return the titles ready for insertion into a URL
     *  @throws IOException if a network error occurs
     *  @since 0.31
     */
    protected String[] constructTitleString(StringBuilder sb, String[] titles) throws IOException
    {
        return constructTitleString(sb, titles, 0);
    }

    /**
     *  Cuts up a list of titles into batches for prop=X&titles=Y type queries.
     *  A version with additional reserve value (for cases when url can grow).
     *
     *  @param url - url which should be requested with this titles[] list
     *  @param titles a list of titles.
     *  @param reserve spare bytes count on which the url may grow.
     *  @return the titles ready for insertion into a URL
     *  @throws IOException if a network error occurs
     *  @since 0.31
     */
    protected String[] constructTitleString(StringBuilder url, String[] titles, int reserve)
            throws IOException
    {
        if (titles.length == 0)
            return new String[0];
        List<String> result = new ArrayList<>();
        StringBuilder buffer = new StringBuilder(urlMaxLength);
        buffer.append(encode(titles[0], true));
        int count = 1;
        String sep = encode("|", false);
        for (int i = 1; i < titles.length; i++)
        {
            String next = encode(titles[i], true);
            if (url.length() + buffer.length() + sep.length() + next.length() + reserve >
                    urlMaxLength || count >= slowmax)
            {
                result.add(buffer.toString());
                buffer.setLength(0);  // This is faster then allocating a new one.
                buffer.append(next);
                count = 1;
            }
            else
            {
                buffer.append(sep);
                buffer.append(next);
                count++;
            }
        }
        result.add(buffer.toString());  // Finish him!
        return result.toArray(new String[result.size()]);
    }

Header for old method:

/**
     *  Cuts up a list of titles into batches for prop=X&amp;titles=Y type queries.
     *  Ignores URL max allowed length.
     *  OK for POST requests.
     *  Not recommended for GET requests (may throw HTTP 414 error if titles count is higher then
     *  ~100 for latin titles or ~50 for other alphabets).
     *  
     *  @param titles a list of titles.
     *  @return the titles ready for insertion into a URL
     *  @throws IOException if a network error occurs
     *  @since 0.29
     */
    protected String[] constructTitleString(String[] titles) throws IOException

So I propose implement both ways, to set GET as default and add a possibility to switch to POST for those who suffer from performance issues (like me currently).

Sorry, I take my words back. I tried POST and it was slower than GET (the requests count was lesser though). I don't know why (no caching?). So, I don't use POST. GET is fast enough. Also, when there are a lot of data you meet the 5000 items API limit and POST gives no help here.

MER-C commented

Fixed in 253111f . The 5000 byte limit is a guess.