Improve page crawling to get all the available data

Question

Improve page crawling to get all the available data

Opened this issue 11 years ago · 2 comments

In order to get accurate compatibility table, we need to get all applicable content.

At the moment we have an issue in fetching the page content due to a limitation to MDN described in the following:

Pages are crawled according to tags (CSS, HTML, HTML5, API and WebAPI at the moment). Tag feeds are limited to 500 results and don't seem to support pagination, need a better way to discover pages, current stats are:

1956 pages for those 5 tags (370 in HTML, 86 in HTML5)

1576 pages left after removing duplicates

846 pages have a compat section

We need to get all the available content.

Useful links

Answer 1 · 2014-02-25T12:05:51.000Z

Summarizing from a chat w/ Renoir in #webplatform:
We have a set of pages with specific tags, imported and cached, but there have been changes and added pages since those were collected.
The issues are:

How do we keep the data up to date?
How do we make sure we've found all the pages with browser compatibility info?
...without over-crawling the MDN site.

One option is to read from the MDN tag feed described here:
https://developer.mozilla.org/en-US/docs/Project:MDN/Tools/Feeds

Recently changed articles, in order by modification date. Only articles that have the specified tag are included in the feed.

However, the set of tags that was used for the import may not cover all pages that have browser compatibility tables. So a feed based on tags may not find everything. Also, it's not certain how frequently we'd need to read from the feed to avoid missing updates.

Another option is to do a web search, looking for the heading used for the compatibility tables: "Browser Compatibility", by some means that limits the search to recently updated pages. This avoids the tag completeness issue. We could then fetch the pages in the search results.

For the web search option, one sub-option is to use Google Custom Search. In normal browser-based Google search, "advanced search" allows these relevant query terms:

exact word or phrase (as_epq)
last updated (as_qdr), with options like during the past week
site (as_sitesearch)
terms appearing... (as_occt), what field in the page the terms appear in, e.g. body

I tried a query with these parameters:

as_epq=Browser+Compatibility
as_qdr=w (past week)
as_sitesearch=developer.mozilla.org
as_occt=body

That yielded about 40 results. Doing such a search (say) twice a week should pick up all changes, and fetching 40 pages per session should not be too much of a hit on MDN.

We might also want to limit by tags. A look at MDN page source shows that tags are included as links of the form:
/en-US/docs/tag/
which translates to:
https://developer.mozilla.org/en-US/docs/tag/
There is a Google search operator "link:" that is part of basic search (but does not appear in the advanced search form) that matches pages containing specific links.
https://support.google.com/websearch/answer/136861
It is not clear that the link: operator works. I tried these search terms:
site:developer.mozilla.org link:developer.mozilla.org/en-US/docs/tag/HTML
The snippets in the search results had the word "link" highlighted, which implies it was not recognized as a search operator. The resulting search url did not have the standard &key=value query format used by advanced search, so it's unclear the two can be mixed.

The above searches work from a browser. Based on hints in documentation on how to do queries programmatically, an http get of that url may not be accepted if done from other than a browser. The Google Search API (for JavaScript) is deprecated -- the old API page suggests using Google Custom Search instead.
https://developers.google.com/custom-search/
https://developers.google.com/custom-search/json-api/v1/overview
In order to use this, one must set up a Custom Search Engine (CSE), and get an API key to use when making queries. There is a limited free level of service, but we would not come anywhere near the limits. I've set up CSEs some while back, but it's likely changed since then. I haven't yet checked which query options are supported. Site is definitely supported, but the critical one is the last update time limit.

Answer 2 · 2014-03-10T04:49:05.000Z

From conversation w/ renoirb, wilmoore, shepazu on IRC:
The main issue is that we may not have gotten all the pages with the requested tags due to the limitation on the tag feed, and that we cannot ask for a "next" batch (i.e. no paging in the tag feed).
shepazu suggests asking MDN folks if we can just get a dump of the pages. Maybe ask for a dump of the database from which the pages are generated -- easier for them, and maybe easier for us. We can look at the schema for the server, https://github.com/mozilla/kuma . So, we'll hold off on this til we get a response. (Omitting chat re. adding crawling to mdn-compat-importer as it's moot if we get a local copy of the MDN database or pages.)