MER-C/wiki-java

Make getTemplates() get templates for an array of titles at once.

Nirvanchik opened this issue · 7 comments

I need a faster way to extract templates for 7300 pages then fetching them one-by-one.
Also, here they recommend:

Also try to combine things into one request. For example: specify multiple '|'-separated titles in a titles parameter instead of making a new request for each title;

So I'm requesting this feature to speed up my program and to decrease the load on Wikipedia.

And here my code:

    /**
     *  Gets the list of templates used on a particular page that are in a
     *  particular namespace(s).
     *
     *  @param title a page
     *  @param ns a list of namespaces to filter by, empty = all namespaces.
     *  @return the list of templates used on that page in that namespace
     *  @throws IOException if a network error occurs
     *  @since 0.16
     */
    public String[] getTemplates(String title, int... ns) throws IOException
    {
        return getPagesTemplates(new String[]{title}, ns)[0];
    }

    /**
     *  Gets the lists of templates used on given pages that are in a
     *  particular namespace(s).
     *
     *  @param titles list of pages
     *  @param ns a list of namespaces to filter by, empty = all namespaces.
     *  @return the list of templates used on those pages in that namespace
     *  @throws IOException if a network error occurs
     *  @since 0.31
     */
    public String[][] getPagesTemplates(String[] titles, int... ns) throws IOException
    {
        String[][] result = new String[titles.length][];
        Map<String, List<String>> pagesTemplates = new HashMap<>();
        StringBuilder getUrl = new StringBuilder(query);
        StringBuilder request = new StringBuilder();
        request.append("prop=templates&tllimit=max");
        constructNamespaceString(request, "tl", ns);
        request.append("&titles=");
        String[] titleBunches;
        if (usePost)
            titleBunches = constructTitleString(titles);
        else
        {
            getUrl.append(request.toString());
            titleBunches = constructTitleString(getUrl, titles, 100);
        }
        for (String temp : titleBunches)
        {
            String line;
            String tlcontinue = null;
            do
            {
                StringBuilder nextRequest;
                if (usePost)
                    nextRequest = new StringBuilder(request);
                else
                    nextRequest = new StringBuilder(getUrl);
                if (tlcontinue == null)
                    nextRequest.append(temp);
                else
                {
                    nextRequest.append(temp);
                    nextRequest.append("&tlcontinue=").append(tlcontinue);
                }
                if (usePost)
                    line = post(query, nextRequest.toString(), "getPagesTemplates");
                else
                    line = fetch(nextRequest.toString(), "getPagesTemplates");
                tlcontinue = parseAttribute(line, "tlcontinue", 0);
                // String line = fetch(url.toString() + temp, "getPagesTemplates");
                // <page _idx="25458" pageid="25458" ns="0" title="Rome">
                // <templates>
                // <tl ns="10" title="Template:About" />
                // ...
                // </templates>
                // </page>
                //
                // Somtimes you can see no closing "</page>":
                // <page _idx="24007" pageid="24007" ns="0" title="Hercules"/>
                // <page _idx="633" pageid="633" ns="0" title="Sun">
                // <templates>
                // <tl ns="10" title="Template:***"/>
                for (int j = line.indexOf("<page "); j > 0; j = line.indexOf("<page ", ++j))
                {
                    int nextPage = line.indexOf("<page", j + 1);
                    String item;
                    if (nextPage > 0) {
                        item = line.substring(j, nextPage);
                    } else {
                        item = line.substring(j);
                    }
                    String parsedtitle = parseAttribute(item, "title", 0);
                    // xml form: <tl ns="10" title="Template:POTD" />
                    List<String> templates = new ArrayList<>(200);
                    if (item.contains("<templates>"))
                        for (int a = item.indexOf("<tl "); a > 0; a = item.indexOf("<tl ", ++a))
                            templates.add(parseAttribute(item, "title", a));
                    if (!templates.isEmpty()) 
                    {
                        List<String> thisPageTemplates = pagesTemplates.get(parsedtitle);
                        if (thisPageTemplates == null)
                            pagesTemplates.put(parsedtitle, templates);
                        else
                            thisPageTemplates.addAll(templates);
                    }
                }
            }
            while (tlcontinue != null);
        }
        for (int i = 0; i < titles.length; i++)
        {
            List<String> templates = pagesTemplates.get(normalize(titles[i]));
            if (templates == null)
                result[i] = new String[0];
            else
                result[i] = templates.toArray(new String[templates.size()]);
        }
        log(Level.INFO, "getPagesTemplates",
                "Successfully retrieved pages templates for " + Arrays.toString(titles));
        return result;
    }

If noone picks up the task, I'll do that later when get free of my heavy duties.

Also, this fixes the limit of 50/500 templates that was justified by words of MER-C:

Capped at max number of templates,
* there's no reason why there should be more than that.

IMPORTANT: The code uses constructTitleString(getUrl, titles, 100); which was introduced in #128 and new usePost boolean flag (default to false).

In my current heavy task this converted 7300 HTTP GETs and 7-20 minutes of execution time into 91 HTTP GET and 52 seconds of time.

MER-C commented

Added in aab2aca. I'm not 100% satisfied with the solution -- I want to have a Map that takes entries in titles to the results, but this will do. Sorry for taking so long.

Merci MER-C! Thanks a lot for solution. I'm happy! Now I'm not afraid to easily switch to the updated Wiki.java (instead of selective merging of certain patches to my copy).
I'm not yet sure though. There were 2 aspects where my Wiki.java jumped forward this year. 1) fast getTemplates, getPageTexts, and may be something else 2) accurate handling of URL length limitation (this is mostly for my wiki-site where this limitation is very strict). Let's see.

MER-C. Thank you for "Breaking change" tags in your commit messages! This is very useful.