This repository contains a sample application built on top of Arabesque. It contains:
- A pre-configured pom.xml file for easy building with Maven.
- Execution scripts and configuration files for easily running your applications.
- Sample data file with the format Arabesque expects.
You may use it as a starting point for developing your own algorithms in Arabesque.
Arabesque and this skeleton project are open-source with the Apache 2.0 license.
To compile this project, you need:
To run the compiled application, you need:
- A 64-bit JVM running under Linux or Mac.
- A functioning installation of Hadoop2 with MapReduce (local or in a cluster)
Fork this project using Github (don't forget to change the repository name!!) or manually by executing the following:
git clone https://github.com/Qatar-Computing-Research-Institute/Arabesque-Skeleton.git $PROJECT_PATH
cd $PROJECT_PATH
git remote rename origin upstream
git remote add origin $YOUR_REPO_URL
You should then edit the pom.xml
file paying particular attention to the following lines:
<groupId>org.example</groupId>
<artifactId>arabesque-skeleton</artifactId>
<version>1.0</version>
<name>Arabesque Skeleton</name>
<description>Skeleton for a new project using the Arabesque system</description>
Give it a descriptive name and description and make sure to change the group and artifact ids.
You should also change the following line in scripts/run_arabesque.sh
to match
your new artifactId:
PROJECT_NAME="arabesque-skeleton"
Your application code should go under the src/main/java
. Included in this
skeleton is a sample implementation of Clique Finding which you might find a
useful starting point for your own implementations. Make sure to rename the
package and class according to your purposes.
You may compile this project as any other normal maven-based project.
If you execute the following command at the root of the project (where the
pom.xml
file is located)
mvn package
Maven will compile and package your application. The resulting jar will
be located under the target
directory:
target/<artifactId>-<version>-jar-with-dependencies.jar
- In a machine with access to an Hadoop cluster, create a directory where you'll put everything necessary to execute your computation:
mkdir example
cd example
- Put all the following files in that directory (using SCP/FTP/...):
-
<artifactId>-<version>-jar-with-dependencies.jar
-
scripts/run_arabesque.sh
-
scripts/cluster.yaml
-
scripts/application.yaml
-
An input graph with the correct input format as expected by Arabesque:
# <num vertices> <num edges> <vertex0Id> <vertex0Label> [<neighbour00Id> <neighbour01Id> ...] <vertex1Id> <vertex1Label> [<neighbour10Id> <neighbour11Id> ...] ...
Vertex ids should be in the range between 0 and
(number of vertices - 1)
. A sample graph is under thedata
directory.
- Upload your input graph to HDFS. A sample graph is under the
data
directory. Make sure you have initialized HDFS first.
hdfs dfs -put <input graph file> <destination graph file in HDFS>
-
Change the settings in
cluster.yaml
andapplication.yaml
to match your cluster, application and data settings (input_graph_path
should point to the final path of the graph in HDFS according to the previous step). -
To start your computation, execute the following (you should probably clean the output directory first):
./run_arabesque.sh cluster.yaml application.yaml
-
You can check the logs of the hadoop containers for progress information.
-
When finished, you can consult the results in the
output_path
HDFS directory as specified on theapplication.yaml
configuration file.