MER-C/wiki-java

Make getPageText() get page texts for an array of titles at once.

Nirvanchik opened this issue · 5 comments

This is a tiket similar to #131.
Motivation is the same - merge thousands of fetches into hundreds or tens of them. Good for performance of both - client software and Wikipedia server.

The work is ongoing.

This will be something like this:

   /**
     *  Gets the raw wikicode for a list of pages. WARNING: does not support special
     *  pages. Check [[User talk:MER-C/Wiki.java#Special page equivalents]]
     *  for fetching the contents of special pages. 
     *
     *  @param titles array of titles of the pages.
     *  @return array with the raw wikicode of pages. The pages which do not exist will
     *  give <code>null</code> as a result.
     *  @throws UnsupportedOperationException if you try to retrieve the text of a
     *  Special: or Media: page
     *  @throws IOException if a network error occurs
     *  @since 0.31
     */
    public String[] getPagesTexts(String... titles) throws IOException
    {
        // pitfall check
        for (String title: titles)
            if (namespace(title) < 0)
                throw new UnsupportedOperationException(
                        "Cannot retrieve Special: or Media: pages!");

        String[] result = new String[titles.length];
        for (int i = 0; i < result.length; ++i)
            result[i] = null;
        StringBuilder getUrl = new StringBuilder(query);
        StringBuilder request = new StringBuilder();
        request.append("prop=revisions");
        request.append("&rvprop=content");
        request.append("&titles=");

        String[] titleBunches;
        if (usePost)
            titleBunches = constructTitleString(titles);
        else
        {
            getUrl.append(request.toString());
            titleBunches = constructTitleString(getUrl, titles, 100);
        }
        for (String temp : titleBunches)
        {
            String line;
            String rvcontinue = null;
            do
            {
                StringBuilder nextRequest;
                if (usePost)
                    nextRequest = new StringBuilder(request);
                else
                    nextRequest = new StringBuilder(getUrl);
                if (rvcontinue == null)
                    nextRequest.append(temp);
                else
                {
                    nextRequest.append(temp);
                    nextRequest.append("&rvcontinue=").append(rvcontinue);
                }
                if (usePost)
                    line = post(query, nextRequest.toString(), "getPagesTexts");
                else
                    line = fetch(nextRequest.toString(), "getPagesTexts");
                rvcontinue = parseAttribute(line, "rvcontinue", 0);
                // String line = fetch(url.toString() + temp, "getPagesTemplates");
                // Typically this looks like:
                // ...
                // <normalized>
                // <n from="ghghghgghg" to="Ghghghgghg" />
                // </normalized>
                // <pages>
                //  <page _idx="-1" ns="0" title="Ghghghgghg" missing="" />
                //  <page _idx="25458" pageid="25458" ns="0" title="Rome">
                //    <revisions>
                //      <rev contentformat="text/x-wiki" contentmodel="wikitext"
                //             xml:space="preserve">{{about|the city in Italy|the
                // ...
                // [[Category:World Heritage Sites in Italy]]</rev>
                // </revisions>
                // </page>
                // <page _idx="26751" pageid="26751" ns="0" title="Sun">
                //  <revisions>
                //    <rev contentformat
                for (int j = line.indexOf("<page _idx="); j > 0;
                        j = line.indexOf("<page _idx=", ++j))
                {
                    int next = line.indexOf("<page _idx=", j + 1);
                    if (next < 0)
                        next = line.length();
                    String item = line.substring(j, next);
                    String pageId = parseAttribute(item, "_idx", 0);
                    // Page not found (or doesn't exist, or deleted, etc).
                    if (pageId == null || pageId.equals("-1"))
                        continue;
                    String parsedtitle = parseAttribute(item, "title", 0);
                    if (!item.contains("<revisions>"))
                        continue;  // WTF? May be <revisions/> ?
                    item = item.substring(item.indexOf("<revisions>"));
                    item = item.substring(0, item.lastIndexOf("</revisions>"));
                    int begin = item.indexOf("<rev ");
                    begin = item.indexOf(">", begin + 1);
                    int end = item.indexOf("</rev>");
                    item = item.substring(begin, end);
                    for (int i = 0; i < titles.length; i++)
                        if (parsedtitle.equals(normalize(titles[i])))
                            result[i] = decode(item);
                }
            }
            while (rvcontinue != null);
        }
        log(Level.INFO, "getPagesTexts",
                "Successfully retrieved pages texts for " + Arrays.toString(titles));
        return result;
    }

I did some related work in my forked copy of the Wiki class, see methods from getContentOfPages to parseContentLine here.

  • getContentOf(Pages|PageIds|RevIds) methods fetch max = 500 pages at once (unless the user specifies a different limit), and automatically fall back to 50 whenever a 414 HTTP error is caught.
  • getContentOf(Categorymembers|Transclusions|Backlinks) do exactly what their names say, but these make use of generators.

PeterBowman

Why don't you make PR (pulllrequest) here - to MER-C code and prefer your fork?

Oh, I see, because you need it long ago in your code.

Actually, my code relies on my version of constructTitleString() which splits titles to pieces and considers not only 50/500 limit implemented by MER-C but also MAX url length limit implemented by mine, and which may be customized.

Wikipedia limit is 8192 bytes, and on my small wiki site, a free hosting that I use allows a limit of ~2250 bytes, so I made setUrlMaxLength method in my Wiki java in my project.

So, everything works fine. I'll push all my patches to this Wiki-java (later in september). I hope MER-C approves them. If he doesn't then I'll keep it in my code only, or may be fork Wiki-java.

MER-C commented

Done in e7e983b.

if (!item.contains("<revisions>"))
    continue;  // WTF? May be <revisions/> ?

Please provide a test case for this.