
DotnetSpider, a .NET Standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework

Primary LanguageC#MIT LicenseMIT


免责申明:本框架如同 Python 下著名的 Scrapy 一样只是为了帮助开发人员简化开发流程、提高开发效率,请勿使用此框架做任何违法国家法律的事情。使用者所做任何事情也与本框架的作者无关。

Build Status NuGet Member project of .NET Core Community GitHub license

DotnetSpider, a .NET Standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework.

If you want get latest beta packages, you should add the myget feed:

<add key="myget.org" value="https://www.myget.org/F/zlzforever/api/v3/index.json" protocolVersion="3" />




  1. Visual Studio 2017 (15.3 or later) or Jetbrains Rider

  2. .NET Core 2.2 or later

  3. Docker

  4. MySql

     docker run --name mysql -d -p 3306:3306 --restart always -e MYSQL_ROOT_PASSWORD=1qazZAQ! mysql:5.7
  5. Redis (option)

     docker run --name redis -d -p 6379:6379 --restart always redis
  6. SqlServer

     docker run --name sqlserver -d -p 1433:1433 --restart always  -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=1qazZAQ!' mcr.microsoft.com/mssql/server:2017-latest
  7. PostgreSQL (option)

     docker run --name postgres -d  -p 5432:5432 --restart always -e POSTGRES_PASSWORD=1qazZAQ! postgres
  8. MongoDb (option)

     docker run --name mongo -d -p 27017:27017 --restart always mongo
  9. Kafka

    docker run -d --restart always --name kafka-dev -p 2181:2181 -p 3030:3030 -p 8081-8083:8081-8083 \
           -p 9581-9585:9581-9585 -p 9092:9092 -e ADV_HOST= \
  10. Docker remote api for mac

    docker run -d  --restart always --name socat -v /var/run/docker.sock:/var/run/docker.sock -p 2376:2375 bobrik/socat TCP4-LISTEN:2375,fork,reuseaddr UNIX-CONNECT:/var/run/docker.sock
  11. HBase

    docker run -d --restart always --name hbase -p 20550:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16010:16010 dajobe/hbase                           




Please see the Project DotnetSpider.Sample in the solution.


Base usage Codes

ADDITIONAL USAGE: Configurable Entity Spider

View complete Codes

public class EntitySpider : Spider
	public EntitySpider(SpiderParameters parameters) : base(parameters)
	protected override void Initialize()
		Scheduler = new QueueDistinctBfsScheduler();
		Speed = 1;
		Depth = 3;
		AddDataFlow(new DataParser<CnblogsEntry>()).AddDataFlow(GetDefaultStorage());
			new Request("https://news.cnblogs.com/n/page/1/", new Dictionary<string, string> {{"网站", "博客园"}}),
			new Request("https://news.cnblogs.com/n/page/2/", new Dictionary<string, string> {{"网站", "博客园"}}));

	[Schema("cnblogs", "news")]
	[EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)]
	[GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)]
	[FollowSelector(XPaths = new[] {"//div[@class='pager']"})]
	public class CnblogsEntry : EntityBase<CnblogsEntry>
		protected override void Configure()
			HasIndex(x => x.Title);
			HasIndex(x => new {x.WebSite, x.Guid}, true);

		public int Id { get; set; }

		[ValueSelector(Expression = "类别", Type = SelectorType.Enviroment)]
		public string Category { get; set; }

		[ValueSelector(Expression = "网站", Type = SelectorType.Enviroment)]
		public string WebSite { get; set; }

		[ValueSelector(Expression = "//title")]
		[ReplaceFormatter(NewValue = "", OldValue = " - 博客园")]
		public string Title { get; set; }

		[ValueSelector(Expression = "GUID", Type = SelectorType.Enviroment)]
		public string Guid { get; set; }

		[ValueSelector(Expression = ".//h2[@class='news_entry']/a")]
		public string News { get; set; }

		[ValueSelector(Expression = ".//h2[@class='news_entry']/a/@href")]
		public string Url { get; set; }

		[ValueSelector(Expression = ".//div[@class='entry_summary']", ValueOption = ValueOption.InnerText)]
		public string PlainText { get; set; }

		[ValueSelector(Expression = "DATETIME", Type = SelectorType.Enviroment)]
		public DateTime CreationTime { get; set; }

Distributed spider

Read this document

WebDriver Support

When you want to collect a page JS loaded, there is only one thing to do, set the downloader to WebDriverDownloader.

Downloader = new WebDriverDownloader(Browser.Chrome);


  1. Make sure the ChromeDriver.exe is in bin folder when use Chrome, install it to your project from NUGET: Chromium.ChromeDriver
  2. Make sure you already add a *.webdriver Firefox profile when use Firefox: https://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-firefox-profiles
  3. Make sure the PhantomJS.exe is in bin folder when use PhantomJS, install it to your project from NUGET: PhantomJS


when you use redis scheduler, please update your redis config:

timeout 0
tcp-keepalive 60

Buy me a coffee


QQ Group: 477731655 Email: zlzforever@163.com