wikipedia-web-crawler

Web-Crawler + Focused Crawler

Crawling the documents:

Start with the following seed URL: http://en.wikipedia.org/wiki/Sustainable_energy; a Wikipedia article about green energy.
Crawler respect the politeness policy by using a delay of at least one second between HTTP requests.
Follow the links with the prefix http://en.wikipedia.org/wiki that lead to articles only (avoid administrative links containing : ). Non-English articles and external links must not be followed.
Crawl to depth 5. The seed page is the first URL in the frontier and thus counts for depth 1.
Stop once it’ve crawled 1000 unique URLs. Keep a list of these URLs in a text file. Also, keep the downloaded documents (raw html, in text format) with their respective URL for future tasks (transformation and indexing)

Focused Crawling:

Crawler should be able to consume two arguments: a URL and a keyword to be matched against text, anchor text, or text within a URL. Starting with the same seed in Task 1, crawl to depth 5 at most, using the keyword “solar”. Crawler return at most 1000 URLS for each of the following:

Breadth first crawling
Depth first crawling

Link Analysis and PageRank Implementation

// P is the set of all pages; |P| = N
// S is the set of sink nodes, i.e., pages that have no out links
// M(p) is the set (without duplicates) of pages that link to page p
// L(q) is the number of out-links (without duplicates) from page q
// d is the PageRank damping/teleportation factor; use d = 0.85 as a fairly typical value

foreach page p in P
  PR(p) = 1/N /* initial value */
while PageRank has not converged do
  sinkPR = 0
  foreach page p in S /* calculate total sink PR */
    sinkPR += PR(p)
  foreach page p in P
    newPR(p) = (1-d)/N /* teleportation */
    newPR(p) += d*sinkPR/N /* spread remaining sink PR evenly */
    foreach page q in M(p) /* pages pointing to p */
      newPR(p) += d*PR(q)/L(q) /* add share of PageRank from in-links */
    foreach page p
      PR(p) = newPR(p)
Return and output final PR score.

dangan249/wikipedia-web-crawler

wikipedia-web-crawler

Web-Crawler + Focused Crawler

Crawling the documents:

Focused Crawling:

Link Analysis and PageRank Implementation