/DistributedWebCrawler

A distributed web crawler implemented in C#

Primary LanguageC#

Distributed Web Crawler

This is a web crawler implementation in C#. The main design goal for this project is to maximize throughput by making it highly horizontally scalable.

This is a work in progress. Below is a list of features currently being implemented.

Current progress

The project currently only contains a console application DistributedWebCrawler.Console, which is used for testing. It can be configured by modifying appsettings.json.

Features

  • Crawl multiple domains in parallel
    • Download content from multiple URIs in parallel
    • Parse links from multiple URIs in parallel
  • Rate limiting, to prevent the same domain getting hammered with requests
  • Should respect robots.txt
  • Pause/resume crawling
  • View/download crawl results
  • Limit the pages that get crawled based on:
    • Crawl depth
    • A list of included/excluded domains

Config settings

This is the default appsettings.json with added comments.

{
  // The seeder will add a list of domains to crawl when the crawler first starts
  "SeederSettings": {
    // Can be set to either ReadFromConfig or ReadFromFile
    "Source": "ReadFromConfig"
    // These domains will be crawled if Source is set to ReadFromConfig
    "UrisToCrawl": [ "http://google.com/", "http://youtube.com/" ],
    // A list of domains from the file at this location will be used if Source is set to ReadFromFile
    "FilePath": "./path/to/file",
  },
  "IngesterSettings": {
    // Don't download links larger than this number of bytes
    "MaxContentLengthBytes": 1048576,
    // The maximum number of domains that can be downloaded from in parallel
    "MaxDomainsToCrawl": 50,
    // The number of HTTP 3xx redirects that will be followed before failing
    "MaxRedirects": 20
  },

  "SchedulerSettings": {
    // The maximum depth of links to follow from the initially requested page
    "MaxCrawlDepth": 5,
    // The maximum number of concurrent requests that can be made for robots.txt files
    "MaxConcurrentRobotsRequests": 50,
    // Will respect Allow, Disallow and CrawlDelay directives in robots.txt if set to true
    "RespectsRobotsTxt": true,
    // The number of milliseconds to wait betwen crawling pages from the same domain
    // this will be overridden by a robots.txt crawl delay directive if it exists
    "SameDomainCrawlDelayMillis": 5000
  },

  "ParserSettings": {
    // The maximum number of threads to use to parse links from HTML documents
    // NB: This currently cannot use more than one thread
    "MaxConcurrentThreads": 1
  },

  "RobotsTxtSettings": {
    // The number of seconds to cache robots.txt (on failure to download robots.txt)
    // NB: 86400 seconds = 24 hours
    "CacheIntervalSeconds": 86400
  },

  // The settings used by the HttpClient used to download content
  "CrawlerClientSettings": {
    // The User-Agent header will be set to this
    "UserAgentString": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36",
    // The Accept-Language header will be set to this
    "AcceptLanguage": "en-US;q=0.9,en;q=0.8",
    // The HttpClient will close the connection after this long
    "TimeoutSeconds": 30
  }
}