Introduction

We built a graph database of indexed science journal articles by scraping http://arxiv.org/

We have around 2000 articles, all in the realm of high energy physics and quantum cosomology.

Our scraping functions ensure a graph as complete as the information arxiv gives us.

We implemented search algorithms to allow the user to search using different ranking criteria.

Front-end: a search box which allows the user to choose ranking functions

Simple boolean match, ranking by article id
Term frequency (TF) ranking
TFIDF ranking
PageRank
PageRank and TFIDF (most accurate and relevant)

Users can "explore" a topic by traversing our graph.

They can view related articles (generated by graph properties and ranked by Levenshtein distance).
They can also traverse by author in the following ways:
View all articles given by a given author.
View a list of authors related to a given author. (Determined by how related their publications are to each other)

Setting Up

Make sure you have the following dependencies:

Ruby version 1.9.3
Rails version 3.2.x

You can check your operating version of Ruby and Rails with the commands

ruby -v
rails -v

If you have the Ruby Version Manager (RVM), you can switch to the right version with

rvm use 1.9.3

Next, run these commands in order

git clone git@github.com:fireblade99/journalgraph.git
cd journalgraph/db
wget https://dl.dropboxusercontent.com/u/45072018/development.sqlite3
cd ..
bundle install
rake db migrate
rails s

To view the running app, open up a browser to the url http://localhost:3000

Database representation

Nodes are articles and authors. Edges are articles-article, article-author, and author-author relations.

Article nodes hold metadata such as publication date, and subject category. They also hold information to aid the search engine:

abstract keywords via automatic summarization
ranking factors, such as PageRank

PageRank

PageRank is a measure of article importance. Highly ranked articles tend to either have many incoming citations from other articles, or are cited by highly cited articles.

To compute PageRank, we first generate the transition probability matrix for our graph. Here is an example transition probability matrix, where each directed edge coming out of a node is weighted by 1/n, and n is that node's total number of outgoing edges.

     a   b   c   d
 a | 0   0  .33  0  |
 b | 1   0  .33  0  |
 c | 0   0   0   0  |
 d | 0   1  .33  0  |

This matrix represents the following graph:
 a ----> b
 ^     /^|
 |   /   |
 | /     |/
 c ----> d

Each row in the matrix corresponds to an article which is being cited. As you can see, article a cites article b, so its edge has weight 1. However, article c cites articles a, b, and c, so each edge has weight 1/3.

TFIDF

TFIDF is a query dependent method of ranking articles by relevance.

TFIDF = TF * IDF

, where TF is proportional to the number of times a term occurs in an article , and IDF is a measure of how special a term is to a given article relative to the other articles in the query.

Ranking