/wp2git

Import wikipedia to git

Primary LanguageCOtherNOASSERTION

Small description of wp2git

This project is currently more or less a c++-clone of import.py from Levitation.
It imports pages from wikimedia-export-files into a git repository.

Have a look at their page at http://levit.at/ion/ for any description of
what this thing is for.

I've written it as an inaccurate speed comparison c++ vs. python.
Nothing to blame, I just have had some spare time and was curious.
Another reason is that I'm believing script languages are not the right tool
for everything (and I like C++) and it seemed this could be a small proof
for that.

Credits are going to the Levitation developers. They have developed the way
how to import the pages into git, I just have rewritten their python script
in C++.

As a note, I'm not really interested in the topic Wikipedia, so I don't know
if I will maintain this program as much as others would like it. This means the
levitation project will most likely a much better contact if you are interested
in the topic of importing wikimedia stuff into git.


To compile it you need cmake.

Build the release-version:
cmake -DCMAKE_BUILD_TYPE=release
make

Afterwards you can keep your box busy with

user@box $ mkdir -p /SeveralGBfree/dewiki.git/.git
user@box $ GIT_DIR=/SeveralGBfree/dewiki.git/.git git init
user@box $ 7z e -bd -so dewiki-20091223-pages-meta-history.xml.7z | ./wp2git -r 62133749 -t /SeveralGBfree/mytempfile | GIT_DIR=/SeveralGBfree/dewiki.git/.git
user@box $ rm /SeveralGBfree/mytempfile
user@box $ GIT_DIR=/SeveralGBfree/dewiki.git/.git git gc --aggressive
user@box $ GIT_DIR=/SeveralGBfree/dewiki.git/.git git reset --hard HEAD

Warning: Running wp2git on large files like dewiki will take very long,
will need a lot of memory (4 GB aren't enough) and diskspace somewhat
around 50 GB (I guess). I haven't tried it by myself upto now.


A help is displayed with ./wp2git -h



Build the debug-version:
cmake -DCMAKE_BUILD_TYPE=debug
make

To get a verbose output during compilation, use
VERBOSE=1 make

Run make edit_cache to modify some varibles.

To compile it you will need some boost-dev-packages.

For Debian 5.0 (Lenny) the runtime dependencies are

libboost-date-time1.35.0
libboost-iostreams1.35.0
libboost-program-options1.35.0

and maybe some more or less boost libraries. I don't keep
this list recent.

To build a binary without any dependencies on boost libraries,
change the line

SET(Boost_USE_STATIC_LIBS OFF)

in CMakeLists.txt to

SET(Boost_USE_STATIC_LIBS ON)

The binary will be somewhat larger, but should be easier to distribute.

You will need at least gcc 3.x, if you don't use gcc 4.x, you have
to remove -std=gnu++0x from the compiler-flags in CMakeLists.txt.

If you really want to compile it with a gcc below 3.x you have to replace
the unordered_map with something like

size_t operator()(std::string const &str) const {
    return __gnu_cxx::hash<char const *>()(str.c_str());
}
typedef __gnu_cxx::hash_map<std::string, Element> MapKeyString;

That will not work, but should show you the direction what to use as a
replacement for the unordered_map. And there might be some other stuff
you have to change to compile it using a gcc < 3.x.