MWoffliner is a tool for making a local offline HTML snapshot of any online Mediawiki instance. It goes through all online articles (or a selection if specified) and create the corresponding ZIM file. It has mainly been tested against Wikimedia projects like Wikipedia, Wiktionary, ... But it should also work for any recent Mediawiki.
Read CONTRIBUTING.md to know more about MWoffliner development.
- Scrape with or without image thumbnail
- Scrape with or without audio/video multimedia content
- S3 cache (optional)
- Image size optimiser / Webp converter
- Scrape all articles in namespaces or title list based
- Specify additional/non-main namespaces to scrape
Run mwoffliner --help
to get all the possible options.
- *NIX Operating System (GNU/Linux, macOS, ...)
- Redis
- NodeJS version 12 or greater
- Libzim (On GNU/Linux & macOS we automatically download it)
- Various build tools which are probably already installed on your
machine (packages
libjpeg-dev
,autoconf
,automake
,gcc
on Debian/Ubuntu)
... and an online Mediawiki with its API available.
To install MWoffliner globally:
npm i -g mwoffliner
You might need to run this command with the sudo
command, depending
how your npm
is configured.
npm
permission checking can be a bit annoying for a
newcommer. Please read the documentation carefully if you hit
problems: https://docs.npmjs.com/cli/v7/using-npm/scripts#user
Then to run it:
mwoffliner --help
To use MWoffliner with a S3 cache, you should provide a S3 URL like this:
--optimisationCacheUrl="https://wasabisys.com/?bucketName=my-bucket&keyId=my-key-id&secretAccessKey=my-sac"
MWoffliner provides also an API and therefore can be used as a NodeJS library. Here a stub example:
const mwoffliner = require('mwoffliner');
const parameters = {
mwUrl: "https://es.wikipedia.org",
adminEmail: "foo@bar.net",
verbose: true,
format: "nopic",
articleList: "./articleList"
};
mwoffliner.execute(parameters); // returns a Promise
Complementary information about MWoffliner:
- MediaWiki software is used by thousands of wikis, the most famous ones being the Wikimedia ones, including Wikipedia.
- MediaWiki is a PHP wiki runtime engine.
- Wikitext is the name of the markup language that MediaWiki uses.
- MediaWiki includes a parser for WikiText into HTML, and this parser creates the HTML pages displayed in your browser.
- There is another WikiText parser, called Parsoid, implemented in Javascript/NodeJS. MWoffliner uses Parsoid.
- Parsoid is planned to eventually become the main parser for MediaWiki.
- MWoffliner calls Parsoid and then post-processes the results for offline format.
Install NodeJS:
curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.11/install.sh | bash && \
source ~/.bashrc && \
nvm install stable && \
node --version
Install Redis:
sudo apt-get install redis-server
Older GNU/Linux distributions and/or versions of Node.js might be
shipped with a deprecated version of npm
. Older versions of npm
have incompatbilities with certain versions of Node.js and might
simply fail to install mwoffliner
package.
We recommend to use a recent version of npm
. Recent versions can
perfectly deal with older Node.js 10. Do install the packaged
version of npm
and then use it to install a newer version like:
sudo npm install --unsafe-perm -g npm
Don't forget to remove the packaged version of npm
afterward.