/m2-spark-hadoop-pig

Learning to use Spark Hadoop Pig technologies in GCP with Python. Doing some benchmarks on the PageRank algorithm

Primary LanguagePython

PageRank on Google Cloud Platform using Pig and Spark

Malo GRALL

Alex MAINGUY

Mathis ROCHER

Method

Small dataset

Pig vs Spark PageRank algorithm - Small dataset (in ms)

Pig vs Spark PageRank algorithm - Table Small dataset

Big dataset

The differences should be more visible but we did not setup the optimal partitioning, so the differences are not clearly visible.

Pig vs Spark PageRank algorithm - Big dataset (in ms)

Pig vs Spark PageRank algorithm - Table Big dataset

Problems

In order to get clearer results, we gathered Spark results and saved them in a separate file instead of printing them in the terminal with the Cloud Logging for python feature of GCP.

With pig we had trouble debugging with the logs because they were not easily accessible in GCP. The Logging menu had some logs but they were only logging the terminal outputs.