Make getPageText() get page texts for an array of titles at once.
Nirvanchik opened this issue · 5 comments
This is a tiket similar to #131.
Motivation is the same - merge thousands of fetches into hundreds or tens of them. Good for performance of both - client software and Wikipedia server.
The work is ongoing.
This will be something like this:
/**
* Gets the raw wikicode for a list of pages. WARNING: does not support special
* pages. Check [[User talk:MER-C/Wiki.java#Special page equivalents]]
* for fetching the contents of special pages.
*
* @param titles array of titles of the pages.
* @return array with the raw wikicode of pages. The pages which do not exist will
* give <code>null</code> as a result.
* @throws UnsupportedOperationException if you try to retrieve the text of a
* Special: or Media: page
* @throws IOException if a network error occurs
* @since 0.31
*/
public String[] getPagesTexts(String... titles) throws IOException
{
// pitfall check
for (String title: titles)
if (namespace(title) < 0)
throw new UnsupportedOperationException(
"Cannot retrieve Special: or Media: pages!");
String[] result = new String[titles.length];
for (int i = 0; i < result.length; ++i)
result[i] = null;
StringBuilder getUrl = new StringBuilder(query);
StringBuilder request = new StringBuilder();
request.append("prop=revisions");
request.append("&rvprop=content");
request.append("&titles=");
String[] titleBunches;
if (usePost)
titleBunches = constructTitleString(titles);
else
{
getUrl.append(request.toString());
titleBunches = constructTitleString(getUrl, titles, 100);
}
for (String temp : titleBunches)
{
String line;
String rvcontinue = null;
do
{
StringBuilder nextRequest;
if (usePost)
nextRequest = new StringBuilder(request);
else
nextRequest = new StringBuilder(getUrl);
if (rvcontinue == null)
nextRequest.append(temp);
else
{
nextRequest.append(temp);
nextRequest.append("&rvcontinue=").append(rvcontinue);
}
if (usePost)
line = post(query, nextRequest.toString(), "getPagesTexts");
else
line = fetch(nextRequest.toString(), "getPagesTexts");
rvcontinue = parseAttribute(line, "rvcontinue", 0);
// String line = fetch(url.toString() + temp, "getPagesTemplates");
// Typically this looks like:
// ...
// <normalized>
// <n from="ghghghgghg" to="Ghghghgghg" />
// </normalized>
// <pages>
// <page _idx="-1" ns="0" title="Ghghghgghg" missing="" />
// <page _idx="25458" pageid="25458" ns="0" title="Rome">
// <revisions>
// <rev contentformat="text/x-wiki" contentmodel="wikitext"
// xml:space="preserve">{{about|the city in Italy|the
// ...
// [[Category:World Heritage Sites in Italy]]</rev>
// </revisions>
// </page>
// <page _idx="26751" pageid="26751" ns="0" title="Sun">
// <revisions>
// <rev contentformat
for (int j = line.indexOf("<page _idx="); j > 0;
j = line.indexOf("<page _idx=", ++j))
{
int next = line.indexOf("<page _idx=", j + 1);
if (next < 0)
next = line.length();
String item = line.substring(j, next);
String pageId = parseAttribute(item, "_idx", 0);
// Page not found (or doesn't exist, or deleted, etc).
if (pageId == null || pageId.equals("-1"))
continue;
String parsedtitle = parseAttribute(item, "title", 0);
if (!item.contains("<revisions>"))
continue; // WTF? May be <revisions/> ?
item = item.substring(item.indexOf("<revisions>"));
item = item.substring(0, item.lastIndexOf("</revisions>"));
int begin = item.indexOf("<rev ");
begin = item.indexOf(">", begin + 1);
int end = item.indexOf("</rev>");
item = item.substring(begin, end);
for (int i = 0; i < titles.length; i++)
if (parsedtitle.equals(normalize(titles[i])))
result[i] = decode(item);
}
}
while (rvcontinue != null);
}
log(Level.INFO, "getPagesTexts",
"Successfully retrieved pages texts for " + Arrays.toString(titles));
return result;
}
I did some related work in my forked copy of the Wiki class, see methods from getContentOfPages
to parseContentLine
here.
getContentOf(Pages|PageIds|RevIds)
methods fetchmax = 500
pages at once (unless the user specifies a differentlimit
), and automatically fall back to 50 whenever a 414 HTTP error is caught.getContentOf(Categorymembers|Transclusions|Backlinks)
do exactly what their names say, but these make use of generators.
PeterBowman
Why don't you make PR (pulllrequest) here - to MER-C code and prefer your fork?
Oh, I see, because you need it long ago in your code.
Actually, my code relies on my version of constructTitleString()
which splits titles to pieces and considers not only 50/500 limit implemented by MER-C but also MAX url length limit implemented by mine, and which may be customized.
Wikipedia limit is 8192 bytes, and on my small wiki site, a free hosting that I use allows a limit of ~2250 bytes, so I made setUrlMaxLength
method in my Wiki java in my project.
So, everything works fine. I'll push all my patches to this Wiki-java (later in september). I hope MER-C approves them. If he doesn't then I'll keep it in my code only, or may be fork Wiki-java.