Wikipedia.org XML Dump Importer is a script to import the standard Wikipedia XML dump into a simple elasticsearch data structure, useful as a local cache for searching and manipulating Wikipedia articles. The datastore structure is designed for ease of use, and is not mediawiki-compatible.
URL: http://dumps.wikimedia.org/
Updates: monthly
- GNU/Linux
- PHP 5.4 + (with mbstring, simplexml extensions)
- Elasticsearch 2.2 +
- php5-curl
- This script is designed to run on the command line - not a web browser
- enwiki download is approximately 9.5GB compressed and will require another (approx.) 10 times that for the uncompressed data and 2 replicas
- This script reads the compressed file.
- Import process required approximately 4 hours on a well configured quad core with 4GB of memory.
- Install elasticsearch via php composer. see notes below
- Download the proper pages-articles XML file - for example, enwiki-20130708-pages-articles.xml.bz2.
- bunzip2 the wiki file
- Create the wikipedia index
curl -XPUT http://localhost:9200/wikipedia -d '{ "settings" : { "number_of_shards" : 12, "number_of_replicas" : 2 } }'
- Download the script.
- Run the script with 2 arguments script.php wikifile.bz2 https://localhost:9200 -- this may take several hours.
The recommended method to install Elasticsearch-PHP is through Composer.
-
Add
elasticsearch/elasticsearch
as a dependency in your project'scomposer.json
file (change version to suit your version of Elasticsearch):cd nano composer.json
{ "require": { "elasticsearch/elasticsearch": "~2.0" } }
-
Download and install Composer:
curl -s http://getcomposer.org/installer | php
-
Install your dependencies:
php composer.phar install --no-dev
This project is BSD (2 clause) licensed.