Page Rank

use scrapy to collect pages relationship information and build page rank dataset

use hadoop and dataset collected by scrapy to implement page rank algorithm

Collect Page Rank Dataset

We use scrapy to collect page rank dataset. The related code locates in the scrapy\ dir

Usage

install scrapy first

pip install scrapy

run scrapy inside scrapy\

cd scrapy
scrapy crawl pagerank

change start_urls and allowed_domains (option)

the default start_urls=['https://github.com/'], we can change it through:

scrapy crawl pagerank -a start='https://stackoverflow.com/'

we can restrict the collecting range into one domain:

scrapy crawl pagerank -a start='https://stackoverflow.com/' -a domain='stackoverflow.com'

Be careful, domain should fit with start; If you don't set start, it should fit with the default start_urls

Dataset

The data collected by spider will be stored into keyvalue, transition and pr0 respectively.

keyvalue records the each url and its key, splited by '\t':

0\thttps://github.com/\n
1\thttps://github.com/features\n
2\thttps://github.com/business\n
3\thttps://github.com/explore\n
...

transtion records the relationship betwenn pages:

0\t0,1,2,3,4,5,6,7,8,9\n
...

page of id 0 points to pages of id 0,1,2,3,4,5,6,7,8,9, they are splited by '\t'

pr0 is the initial page rank value for each page:

0\t1\n
1\t1\n
2\t1\n
...

Later, we need to put transition and pr0 into hdfs and use hadoop to calculate page rank.

Implement Page Rank Algorithm

We use hadoop to implement the page rank algorithm

compile and package hadoop project by maven

cd hadoop
mvn package -DskipTests

put transition and pr0 produced by spider into hdfs

hdfs dfs -mkdir -p pagerank/transition/
hdfs dfs -put ../scrapy/transition pagerank/transition/
hdfs dfs -mkdir -p pagerank/pagerank/pr0
hdfs dfs -put ../scrapy/pr0 pagerank/pagerank/pr0/
hdfs dfs -mkdir -p pagerank/keyvalue/
hdfs dfs -put ../scrapy/keyvalue pagerank/keyvalue/
hdfs dfs -mkdir -p pagerank/cache

deal with MySQL

create MySQL database

mysql -uroot -p
create database PageRank;
use PageRank;
create table output(page_id VARCHAR(250), page_url VARCHAR(250), page_rank DOUBLE);
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'password' WITH GRANT OPTION;
FLUSH PRIVILEGES;

download connector and put it into HDFS

hdfs dfs -mkdir /mysql 
hdfs dfs -put mysql-connector-java-*.jar /mysql/

run hadoop project

hadoop jar page-rank.jar Driver pagerank/transition pagerank/pagerank/pr pagerank/keyvalue pagerank/cache/unit 10

Driver is the entry class
pagerank/transition/ is the dir of transition in hdfs
pagerank/pagerank/pr is the base dir of pr0 in hdfs
pagerank/keyvalue is the dir of keyvalue in hdfs
pagerank/unit is the dir of middle results
10 is the times of convergence

scloudyy/PageRank

Page Rank

Collect Page Rank Dataset

Usage

Dataset

Implement Page Rank Algorithm

Reference