Wiki-java doesn't handle long URLs
Nirvanchik opened this issue · 6 comments
As [http://www.ietf.org/rfc/rfc2616.txt](HTTP spec) says:
A server
SHOULD return 414 (Request-URI Too Long) status if a URI is longer
than the server can handle
So this happens when you call Wiki.exists(pages) with too many those pages in list.
I observed this: 1) In April this year with Wiki.exists(). But that was on my small wiki project where I have a free hosting and so I thought the hoster had limited my account this way becauze free accounts are always limited in any way. I had to disable highlimit in Wiki.java and even after that it was not enough. When I requested randomly a list of templates with a very long titles it failed again and I decreased slowmax to 40 and then to 30 finally. I don't know exact limitation but at that time I discovered its approximate value empirically as 2300 characters.
2) Tonight. Tonight I tried to implement getPagesTemplates(String titles[], int namespace...) and goddamn I see this HTTP 414 again. On real Wikipedia API 👎 :(
I set slowmax to 50 and my getPagesTemplates() works fine.
I think that constructTitleString(String[] titles) must consider MAX URL LENGTH limitation and the limitation should be a class member and be configurable.
I don't know exact limitation of Wikipedia and asked it here: https://www.mediawiki.org/wiki/API_talk:Main_page#What_is_maximum_URL_length_I_can_use_with_Wikipedia_API_.3F
They say the limit is 8192 bytes but this is not guaranteed as server admin can change it. Also this can very from server to server. So I guess the variable maxUrlLength should adapt somehow to this. Probably catch HTTP 414 and handle it.
But the first value is 8192, I checked it and it looks like true. With more than 82xx bytes requests fail.
So, one HTTP GET can take ~50-60 ciryllic titles or ~100 latin titles.
There is another solution to this.
Using POST.
POST make it faster, cauze you can POST 500 titles at once.
But POST is not recommended here: https://www.mediawiki.org/wiki/API:Main_page#API_etiquette
So I propose implement both ways, to set GET as default and add a possibility to switch to POST for those who suffer from performance issues (like me currently).
We can allow setting POST for only 1 next operation or globally for all operations.
2 additional methods:
/**
* Cuts up a list of titles into batches for prop=X&titles=Y type queries.
*
* @param url - url which should be requested with this titles[] list
* @param titles a list of titles.
* @return the titles ready for insertion into a URL
* @throws IOException if a network error occurs
* @since 0.31
*/
protected String[] constructTitleString(StringBuilder sb, String[] titles) throws IOException
{
return constructTitleString(sb, titles, 0);
}
/**
* Cuts up a list of titles into batches for prop=X&titles=Y type queries.
* A version with additional reserve value (for cases when url can grow).
*
* @param url - url which should be requested with this titles[] list
* @param titles a list of titles.
* @param reserve spare bytes count on which the url may grow.
* @return the titles ready for insertion into a URL
* @throws IOException if a network error occurs
* @since 0.31
*/
protected String[] constructTitleString(StringBuilder url, String[] titles, int reserve)
throws IOException
{
if (titles.length == 0)
return new String[0];
List<String> result = new ArrayList<>();
StringBuilder buffer = new StringBuilder(urlMaxLength);
buffer.append(encode(titles[0], true));
int count = 1;
String sep = encode("|", false);
for (int i = 1; i < titles.length; i++)
{
String next = encode(titles[i], true);
if (url.length() + buffer.length() + sep.length() + next.length() + reserve >
urlMaxLength || count >= slowmax)
{
result.add(buffer.toString());
buffer.setLength(0); // This is faster then allocating a new one.
buffer.append(next);
count = 1;
}
else
{
buffer.append(sep);
buffer.append(next);
count++;
}
}
result.add(buffer.toString()); // Finish him!
return result.toArray(new String[result.size()]);
}
Header for old method:
/**
* Cuts up a list of titles into batches for prop=X&titles=Y type queries.
* Ignores URL max allowed length.
* OK for POST requests.
* Not recommended for GET requests (may throw HTTP 414 error if titles count is higher then
* ~100 for latin titles or ~50 for other alphabets).
*
* @param titles a list of titles.
* @return the titles ready for insertion into a URL
* @throws IOException if a network error occurs
* @since 0.29
*/
protected String[] constructTitleString(String[] titles) throws IOException
So I propose implement both ways, to set GET as default and add a possibility to switch to POST for those who suffer from performance issues (like me currently).
Sorry, I take my words back. I tried POST and it was slower than GET (the requests count was lesser though). I don't know why (no caching?). So, I don't use POST. GET is fast enough. Also, when there are a lot of data you meet the 5000 items API limit and POST gives no help here.