I wrote this for the Archive Team. Archive Team is a loose collective of rogue archivists whose speciality is rescuing user data on web services before they are shut down. In the run-up to a shutdown, Archive Team needs to find as many websites hosted on a given domain name as possible. Major MediaWiki wikis are one great source of such sites, as most major MediaWiki installs have the Linksearch extension installed.
With the --output-format=csv and --use-wikis=WIKI option, this may also be useful for Wikipedia editors needing machine-readable lists of links to dead or soon-to-be-dead sites. Have fun.
You need Beautiful Soup 3
installed. Pick your favourite of
sudo apt-get install python-beautifulsoup
or
pip install BeautifulSoup
or
easy_install BeautifulSoup
.
Basic usage goes like this:
python mwlinkscrape.py www.dyingsite.com > sitelist.txt
You may also find stuff on subdomains:
python mwlinkscrape.py "*.dyingsite.com" > sitelist.txt
By default, mwlinkscrape will work as follows:
-
Grab a page on the Archive Team wiki, maintained by Archive Team volunteers, which contains a list of major MediaWiki installations with the LinkSearch extension installed.
-
Scrape each site on the list for external links and dump the URLs on stdout, one per-line.
Command-line options:
$ python mwlinkscrape.py -h usage: mwlinkscrape.py [-h] [--default-wikis] [--use-wikis WIKIS] [--output-format FMT] [--verbose] SITE [SITE ...] Scrape MediaWiki wikis to find links to a given site or domain. positional arguments: SITE site or domain to find, may contain wildcards such as *.wikipedia.org optional arguments: -h, --help show this help message and exit --default-wikis Use a built-in list of 12 major Wikimedia wikis, rather than connecting to the Archive Team wiki to get a list. --use-wikis WIKIS Specify a comma-separated list of wikis to use, rather than grabbing a list from the Archive Team wiki. This must be the path to the MediaWiki index.php script (for example, http://en.wikipedia.org/w/index.php). The LinkSearch extension must be installed on the site. --output-format FMT Use output format FMT, where FMT is either "txt" (plain list of URLs, one per line, the default) or "csv" (comma-separated value files, with the first column being the linked URL, the second being the URL of the wiki article from which it is linked, and the third being the title of the article). Default is "txt". --verbose Print uninteresting chatter to stderr for debugging purposes.
There are none that I know of. As it works by scraping HTML, mwlinkscrape is very prone to breakage if the MediaWiki HTML changes, and it's also likely to break on custom MediaWiki templates.
The list of wikis on the Archive Team wiki could do with expanding.
This was written by Lewis Collard.
The program and this README is in the public domain, to be used, modified, and/or redistributed with no restrictions.
There are no warranties of any kind; use at your own risk.