/go_spider

[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to a Individualized crawler easily or can only use the default crawl components.

Primary LanguageGoMozilla Public License 2.0MPL-2.0

go_spider

Build Status

A crawler of vertical communities that achieved by GOLANG.

image

Latest stable Release: Version 1.0 (Sep 23, 2014).

  • go_spider讨论群 QQ群号:337344607

Features

  • Concurrent
  • Suit for vertical communities
  • Flexible, Modular
  • Native Go implementation
  • Can be expanded to individualized easily

Requirements

  • Go 1.1 or higher

Documentation

中文文档 && 常见问题.

Installation

go get github.com/hu17889/go_spider
go get github.com/PuerkitoBio/goquery
go get github.com/bitly/go-simplejson

This project is dependent on simplejson, goquery.

Use example

Here is an example for crawl github content. You can have a try for experience the crawl process.

  • go install github.com/hu17889/go_spider/example/github_repo_page_processor
  • ./bin/github_repo_page_processor

More examples here: examples.

Make your spider

    // Spider input:
    //  PageProcesser ;
    //  Task name used in Pipeline for record;
    spider.NewSpider(NewMyPageProcesser(), "TaskName").
        AddUrl("https://github.com/hu17889?tab=repositories", "html"). // Start url, html is the responce type ("html" or "json")
        AddPipeline(pipeline.NewPipelineConsole()).                    // Print result on screen
        SetThreadnum(3).                                               // Crawl request by three Coroutines
        Run()
  • Use default modules

  • Downloader:HttpDownloader

  • Scheduler:QueueScheduler

  • Pipeline:PipelineConsole,PipelineFile

  • Use your modules

Just copy the default modules and modify it!

If you make a Downloader module, you can use it by Spider.SetDownloader(your_downloader).

If you make a Pipeline module, you can use it by Spider.AddPipeline(your_pipeline).

If you make a Scheduler module, you can use it by Spider.SetScheduler(your_scheduler).

License

go_spider is licensed under the Mozilla Public License Version 2.0

Mozilla summarizes the license scope as follows:

MPL: The copyleft applies to any files containing MPLed code.

That means:

  • You can use the unchanged source code both in private as also commercial
  • You needn't publish the source code of your library as long the files licensed under the MPL 2.0 are unchanged
  • You must publish the source code of any changed files licensed under the MPL 2.0 under a) the MPL 2.0 itself or b) a compatible license (e.g. GPL 3.0 or Apache License 2.0)

Please read the MPL 2.0 FAQ if you have further questions regarding the license.

You can read the full terms here: LICENSE.