/mwoffliner

Scrape any online Mediawiki motorised wiki (like Wikipedia) to your local filesystem

Primary LanguageTypeScriptGNU General Public License v3.0GPL-3.0

MWoffliner

MWoffliner is a tool for making a local offline HTML snapshot of any online Mediawiki instance. It goes through all articles (or a selection if specified) and writes the HTML/images to a local directory. It has mainly been tested against Wikimedia projects like Wikipedia, Wiktionary, ... But it should also work for any recent Mediawiki.

It can write the raw HTML/JS/CSS/PNG... files to the filesystem or pack them all in a highly compressed ZIM file.

Read CONTRIBUTING.md to know more about MWoffliner development.

NPM

npm Docker Build Status Build Status codecov CodeFactor License

Prerequisites

  • *NIX Operating System (GNU/Linux, macOS, ...)
  • NodeJS
  • Redis
  • Libzim (On linux we automatically download binaries)
  • Various build tools that are probably already installed on your machine (libjpeg, gcc)

See Environment setup hints to know more about how to install them.

Usage

To install MWoffliner globally:

npm i -g mwoffliner

You might need to run this command with the sudo command, depending how your npm is configured.

Then to run it:

mwoffliner --help

API

MWoffliner provides also an API and therefore can be used as a NodeJS library. Here a stub example:

const mwoffliner = require('mwoffliner');
const parameters = {
    mwUrl: "https://es.wikipedia.org",
    adminEmail: "foo@bar.net",
    verbose: true,
    format: "nozim",
    articleList: "./articleList"
};
mwoffliner.execute(parameters); // returns a Promise

Background

Complementary information about MWoffliner:

  • MediaWiki software is used by dozen of thousands of wikis, the most famous ones being the Wikimedia ones, including Wikipedia.
  • MediaWiki is a PHP wiki runtime engine.
  • Wikitext is the name of the markup language that MediaWiki uses.
  • MediaWiki includes a parser for WikiText into HTML, and this parser creates the HTML pages displayed in your browser.
  • There is another WikiText parser, called Parsoid, implemented in Javascript/NodeJS. MWoffliner uses Parsoid.
  • Parsoid is planned to eventually become the main parser for MediaWiki.
  • MWoffliner calls Parsoid and then post-processes the results for offline format.

Environment setup hints

macOS

Install NodeJS:

curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.11/install.sh | bash && \
source ~/.bashrc && \
nvm install stable && \
node --version

Install Redis:

brew install redis

Install libzim: Read these instructions

GNU/Linux - Debian based distributions

Install NodeJS:

curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.11/install.sh | bash && \
source ~/.bashrc && \
nvm install stable && \
node --version

Install Redis:

sudo apt-get install redis-server

Releasing

  1. Update package.json
  2. Commit :package: Release version vX.X.X
  3. Run git tag vX.X.X
  4. Run git push origin master --tags

License

GPLv3 or later, see LICENSE for more details.