Analysis of ArXiv.org database of scientific papers
MSAN 694 - Distributed Computing
Team: D. Wen, A. Romriell, J. Pastor, J. Pollard
Data Description:
Source: ArXiv Electronic Archive of Scientific Papers
We analyzed the entire database of arXiv.org (1.6GB):
- 1.26 million papers
- 600,000 authors
- 86,262,827 words
Goal:
Exploratory Data Analysis and Community Detection, implemented with three different technologies (postgreSQL, Hive and Spark) for performance comparison.
Experimental environment:
Local:
MacBook Pro 2.7 GHz Intel Core i5 16 GB 1867 MHz DDR3
Distributed:
4-node cluster of r3.xlarge (160GB) emr-4.6.0
Hadoop distribution: Amazon 2.7.2
Applications: Hive 1.0.0, Pig 0.14.0, Spark 1.6.1
Summary results:
The following table summarizes the running time (in seconds) of the tasks in each of the different platforms (postgreSQL, Hive and SparkSQL): As the queries grew in complexity (4 and 5), Hive and SparkSQL perform drastically better than PostgreSQL. In particular, we observed a reduction in running times between 80% and 97% using SparkSQL.
We concluded that - in the context of this problem - SparkSQL was the optimal tool given its speed, ease of use and flexibility.