peterbencze/serritor

Further develop the Crawl Frontier

peterbencze opened this issue · 0 comments

Two new classes should be implemented:

  • CrawlRequest
    These objects are constructed by the frontier. When the frontier receives a CrawlResponse (see below), it should loop through the extracted URLs in the response and construct a CrawlRequest object - containing the crawl depth and the URL as String - if and only if the URL has not already been visited by the crawler (check its fingerprint). If the URL has not already been visited, add it to the priority queue (see below).
  • CrawlResponse
    These objects are constructed by the crawler. When the crawler extracts a list of URLs from a page, it should construct a CrawlResponse object with the crawl depth (request's crawl depth + 1) and the list of extracted URLs (use URL type) and pass it to the frontier.

The CrawlFrontier class should contain a priority queue (use PriorityQueue) of these requests, sorted by their crawl depth (according to the configuration). When the crawler asks the frontier if it has a next request, the frontier should check if the queue is empty or not. When the crawler asks for the next request, the frontier should get the first element from the queue (which is a CrawlRequest object) and return it to the crawler (PriorityQueue has a poll method which is perfectly suitable for this).

CrawlFrontier should be initialized with a list of URLs (seeds). For each of these URLs, a new CrawlRequest object is constructed and added to the priority queue. A fingerprint for the URL is also created and added to the list of fingerprints.