PomeGranate is just an experiment in order to provide a fault tolerant MapReduce framework built on top of MPI and the HTTP protocol, through the use of Python programming language.
In the source tarball we also provide a simple application of the framework used to create a TF-IDF index of Wikipedia.
In order to compile and succesfully run this application you need:
- A Python interpreter targeting 2.x series
- A C compiler preferably GNU GCC
- An MPI implementation.
The application was tested under:
- An Intel Core 2 (T7600) Linux machine with 3.1.0-x86_64 kernel
- Python 2.7.2
- gcc (GCC) 4.6.2
- mpich2 1.4-1
In order to install the proper requirements for the Python interpreter I really suggest you to use virtualenv and pip utilities.
pip install mpi4py
(1.2.2 - used for coordinating workers)pip install jinja2
(2.6 - used for the Web interface)
If you want you can optionally install the fsdfs module (fucking simple
distributed filesystem) in order to get some performance boost in case
you are using a slow distributed filesystem like NFS. To do it simply
point your shell into lib/fsdfs
and type python setup.py install
.
Of course you have to enable the specific toggle option in the json
configuration file.
In order to compile the C implementation of the indexer you need libarchive and glib2. Take as reference your package manager manual.
- libarchive (2.8.5-2 - http://libarchive.googlecode.com/)
- glib2 (2.30.1-1 - http://www.gtk.org/)
In order to fetch the software with all the dependencies just type in your console:
$ git clone git://github.com/nopper/PomeGranate.git
$ cd PomeGranate
I strongly suggest you to use virtualenv in order to test PomeGranate without impacting your python installation globally:
$ virtualenv env
$ . env/bin/activate
New python executable in env/bin/python
Installing setuptools............done.
Installing pip...............done.
$ python setup.py install
[...]
In case you don't have a distributed FS like NFS or AFS you can simply use FSDFS which is supported by PomeGranate:
$ git submodule init
$ git submodule update
$ cd lib/fsdfs
$ python setup.py install
In the source distribution you will find a simple MapReduce application for
Reverse Index creation (here comes the name). A part of the application is
developed in C language. In order to build the mapper and the reducer
components you need make
, gcc
, glib-2.0
, libarchive
:
$ cd apps/ri/indexer/
$ make
The Makefile is hardcoded so if you are not able to get a clean build try to play a little bit with the parameters. After the build you should get two executables namely:
apps/ri/indexer/map/mapper
apps/ri/indexer/reduce/reducer
The absolute path to these executables should match the map-executable
and reduce-executable
parameters in the JSON configuration file. Another
parameter you would like to change is limit-size
an integer value which
limits the maximum number of KB that the mapper will use to keep in RAM
all the information before flushing all the data to an output file.
The configuration file is a simple JSON file that express various parameters that will be used by various parts of the application. The parameters are:
num-mapper
: integer indicating the number of mapper to usenum-reducer
: integer indicating the number of reducer to usemachine-file
: string indicating the machine file to use for deploying workers through MPI process abstractionmain-module
: string indicating the Python main module to useinput-module
: string indicating the Python Input module to usemap-module
: string indicating the Python Map module to usereduce-module
: string indicating the Python Reduce module to usethreshold-nfile
: integer indicating the maximum number of files that can be reduced in a row.ping-interval
: integer indicating the interval in seconds between two consecutive ping probes.sleep-interval
: numeric indicating seconds between two consecuting work requests from the generic workermaster-host
: string indicating the IP on which the server will bind tomaster-port
: integer indicating the port on which the server will listen tomaster-url
: string indicating a URI in the formhttp://<master-host>:<master-port
datadir
: a string indicating the directory where all files (both inputs and outputs) are be storedinput-prefix
: a string indicating a suffix for the datadir parameter. All the input files should be located in<datadir>/<input-prefix>
.output-prefix
: a string indicating a suffix for the datadir parameter. All the output files will be stored inside<datadir>/<output-prefix>
.
The following parameters are related to the DFS module:
dfs-enabled
: boolean indicating if DFS support will be useddfs-host
: string indicating the IP on which the node will listen ondfs-startport
: integer indicating a TCP port. Every worker will listen ontostartport
+unique_worker_id
.dfs-conf
: a dictionary which include valid parameters for configuring FSDFS nodes.