- Panagiotis Papaemmanouil, 56
- Dimitris Tourgaidis, 66
In this project, you will work with graph data. Given a potentially large probabilistic network and an integer
k, you must design and implement a solution to discover the top-k most probable triangles. A probabilistic
network is a network where edges are annotated with existential probabilities. This means that an edge e is
present with a probability p(e). Therefore, if a triangle is formed among the edges a, b and c, then the
existential probability of the triangle is p(a)*p(b)*p(c)
- Python: https://www.python.org/
- Spark: https://spark.apache.org/docs/latest/index.html
- PySpark: https://spark.apache.org/docs/latest/api/python/index.html
- Graphframes spark package: https://graphframes.github.io/graphframes/docs/_site/quick-start.html
Create a docker container based on ubuntu as base image and install java, javac, spark, python
# Create docker container based on ubuntu base image and get a console inside it
sudo docker run --it --name ApacheSpark -p 1234:1234 ubuntu bin/bash
Download and unzip spark (version: spark-3.1.1-bin-hadoop2.7.tgz) on host machine from https://spark.apache.org/downloads.html
# Copy spark from host machine to container
sudo docker cp ./path_to_spark/spark-3.1.1-bin-hadoop2.7 ApacheSpark:.
Setup container. Run the following command from inside the container
# Update container packages
apt update
# Install java
apt install default-jre
# Install jdk compiler
$ apt install default-jdk
# Setup environmental variables
apt-get install nano
sudo nano /etc/environment
# Add these lines to `/etc/environment` file
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
export PATH=$PATH:spark-3.1.1-bin-hadoop2.7/bin
export PATH=$PATH:spark-3.1.1-bin-hadoop2.7/sbin
export PYSPARK_PYTHON=/usr/bin/python3.8
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.8
export SPARK_HOME=spark-3.1.1-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python
# Save and Close file
# Source changes
$ source /etc/environment
# Install python
apt-get install python3.8
apt-get install python3.8-venv
apt-get install python3-pip
# Install numpy
$ python3.8 -m pip install numpy
Check installations
# Check versions
java --version
java --version
# Check python
python3.8
# Check pyspark and spark-shell interactive terminals
pyspark
spark-shell
Helpful Docker commands
docker ps
docker ps -a
docker start ApacheSpark
docker stop ApacheSpark
docker rm ApacheSpark
How to run src/ scripts using spark-submit
spark-submit --master local[3] --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 src/graphframe_bs.py 2 cache
spark-submit --master local[3] --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 src/rdd_bs.py 2
Spark-submit takes the flag --packages and search for packages.
These packages can be found localy or can be downloaded via internet connection. For the specific case of graphframes
spark package, the download process isn't straight-forward, because only the source code was available online and not the .jar file.
- Check here for the source code: https://spark-packages.org/package/graphframes/graphframes
- Check in the
jars/graphframes_graphframes-0.8.1-spark3.0-s_2.12.jar
directory for the graphframes.jar file.
The solution was to manualy build the .jar file and copy it to one of the desired paths. For example, we need to copy the .jar file to one of the following paths.
# Copy the graphframes.jar file to one of the following paths
/home/user7/.ivy2/local/graphframes/graphframes/0.8.1-spark3.0-s_2.12/jars/graphframes.jar
/home/user7/.m2/repository/graphframes/graphframes/0.8.1-spark3.0-s_2.12/graphframes-0.8.1-spark3.0-s_2.12.jar
In order to suppress warning during Apache spark execution, you need to modify the conf/log4j/properties
file.
cp log4j.properties.template log4j.properties
nano log4j.properties
# and change the following line
log4j.rootCategory=INFO, console --> log4j.rootCategory=ERROR, console
SNAP Artists dataset: https://snap.stanford.edu/data/gemsec-Facebook.html
We evaluate our implementations through a Real World dataset provided from the Stanford Network Analysis Project repository (SNAP\footnote{\href{https://snap.stanford.edu/index.html}{Snap website}}), which is the artists dataset\footnote{\href{https://snap.stanford.edu/data/gemsec-Facebook.html}{Artist dataset link}}. This dataset referred to Facebook pages and it was collected in November 2017. It represents blue verified Facebook page networks. The network's nodes represent the artists' pages and the network's edges represent the mutual likes among them.
- Nodes: 50,515
- Edges: 819,306
- Total triangles: 2,273,700