/distributed-computing-arxiv

Scraping and analysis of Arxiv.org database, comparing performance of PostgreSQL, Hive and Spark

Primary LanguagePython

Analysis of ArXiv.org database of scientific papers

MSAN 694 - Distributed Computing
Team: D. Wen, A. Romriell, J. Pastor, J. Pollard

Data Description:

Source: ArXiv Electronic Archive of Scientific Papers
We analyzed the entire database of arXiv.org (1.6GB):

  • 1.26 million papers
  • 600,000 authors
  • 86,262,827 words

Goal:

Exploratory Data Analysis and Community Detection, implemented with three different technologies (postgreSQL, Hive and Spark) for performance comparison.

Experimental environment:

Local:

MacBook Pro 2.7 GHz Intel Core i5 16 GB 1867 MHz DDR3

Distributed:

4-node cluster of r3.xlarge (160GB) emr-4.6.0
Hadoop distribution: Amazon 2.7.2
Applications: Hive 1.0.0, Pig 0.14.0, Spark 1.6.1

Summary results:

The following table summarizes the running time (in seconds) of the tasks in each of the different platforms (postgreSQL, Hive and SparkSQL): As the queries grew in complexity (4 and 5), Hive and SparkSQL perform drastically better than PostgreSQL. In particular, we observed a reduction in running times between 80% and 97% using SparkSQL.

We concluded that - in the context of this problem - SparkSQL was the optimal tool given its speed, ease of use and flexibility.