/PageRank

build a scrapy spider to collect Page Rank dataset. use hadoop(MapReduce, HDFS) to implement Page Rank algorithm.

Primary LanguagePython

Page Rank

use scrapy to collect pages relationship information and build page rank dataset

use hadoop and dataset collected by scrapy to implement page rank algorithm

Collect Page Rank Dataset

We use scrapy to collect page rank dataset. The related code locates in the scrapy\ dir

Usage

  1. install scrapy first
pip install scrapy
  1. run scrapy inside scrapy\
cd scrapy
scrapy crawl pagerank
  1. change start_urls and allowed_domains (option)

the default start_urls=['https://github.com/'], we can change it through:

scrapy crawl pagerank -a start='https://stackoverflow.com/'

we can restrict the collecting range into one domain:

scrapy crawl pagerank -a start='https://stackoverflow.com/' -a domain='stackoverflow.com'

Be careful, domain should fit with start; If you don't set start, it should fit with the default start_urls

Dataset

The data collected by spider will be stored into keyvalue, transition and pr0 respectively.

keyvalue records the each url and its key, splited by '\t':

0\thttps://github.com/\n
1\thttps://github.com/features\n
2\thttps://github.com/business\n
3\thttps://github.com/explore\n
...

transtion records the relationship betwenn pages:

0\t0,1,2,3,4,5,6,7,8,9\n
...

page of id 0 points to pages of id 0,1,2,3,4,5,6,7,8,9, they are splited by '\t'

pr0 is the initial page rank value for each page:

0\t1\n
1\t1\n
2\t1\n
...

Later, we need to put transition and pr0 into hdfs and use hadoop to calculate page rank.

Implement Page Rank Algorithm

We use hadoop to implement the page rank algorithm

  1. compile and package hadoop project by maven
cd hadoop
mvn package -DskipTests
  1. put transition and pr0 produced by spider into hdfs
hdfs dfs -mkdir -p pagerank/transition/
hdfs dfs -put ../scrapy/transition pagerank/transition/
hdfs dfs -mkdir -p pagerank/pagerank/pr0
hdfs dfs -put ../scrapy/pr0 pagerank/pagerank/pr0/
hdfs dfs -mkdir -p pagerank/keyvalue/
hdfs dfs -put ../scrapy/keyvalue pagerank/keyvalue/
hdfs dfs -mkdir -p pagerank/cache
  1. deal with MySQL
  • create MySQL database
mysql -uroot -p
create database PageRank;
use PageRank;
create table output(page_id VARCHAR(250), page_url VARCHAR(250), page_rank DOUBLE);
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'password' WITH GRANT OPTION;
FLUSH PRIVILEGES;
  • download connector and put it into HDFS
hdfs dfs -mkdir /mysql 
hdfs dfs -put mysql-connector-java-*.jar /mysql/
  1. run hadoop project
hadoop jar page-rank.jar Driver pagerank/transition pagerank/pagerank/pr pagerank/keyvalue pagerank/cache/unit 10
  • Driver is the entry class
  • pagerank/transition/ is the dir of transition in hdfs
  • pagerank/pagerank/pr is the base dir of pr0 in hdfs
  • pagerank/keyvalue is the dir of keyvalue in hdfs
  • pagerank/unit is the dir of middle results
  • 10 is the times of convergence

Reference