commonsmachinery/commonshasher

Exclude redirects

Closed this issue · 3 comments

Turns out that the current hasher includes "Redirect" pages as individual files. For instance, this one:

https://commons.wikimedia.org/wiki/File:Tour_through_white_sands_national_monument_New_Mexico_.jpeg

is a page which redirects to:

https://commons.wikimedia.org/wiki/File:Tour_through_white_sands_national_monument_New_Mexico.jpeg

But in the hasherdata database, it's included twice, resulting in a duplicate work.

If calling the API with prop=imageinfo|info instead of just prop=imageinfo, we get some additional details. URLs which redirect somewhere else return this property:

{ "query": { "pages": "": {
"redirect": ""
}}}

URLs that do not redirect do not include this property at all. It seems that it would be easy to add this to the code so that it skips over redirected URLs. But we should see if we can find a quicker way to do this for the 22M works we already have!

It seems we could do this from the database dump fairly easy. Here's an example of a page record for a redirect page:

<page>
<title>File:Noia 64 apps mouse.png</title>
<ns>6</ns>
<id>189642</id>
<redirect title="File:Noia 64 devices mouse.png" />
<revision>
...
</revision>
</page>

So essentially any <title> which has a <redirect> could be removed form the database.

Working on this in db01:~hasher/commonshasher/validate_commons.py , will let it run through the entire database dump to get a feel for how many works we're talking about here. Doesn't seem to be too many, around 220 if the initial 1M is representative, so less than I feared.