/pider

A elegant , powerful , modulized spider framework.

Primary LanguagePHPMIT LicenseMIT

Pider

996.icu LICENSE

Docmentation: English Chinese

Introduction

   Pider is an elegant,powerful,modulized,templatized spider framework.It aims to fertilize the php community and make easier for PHP deveoper to write a spider. view document in details.

Features

  • Crawler (Support)
  • Crawler with multi-process (Support)
  • Command line interface (Not full support )
  • Template crawling(Not support)
  • Well debug interface(Not support)
  • Data cleaning(Not full support)
  • Data visuliazation(Not support)
  • Distribution (Support)

Requirments

  • PHP >= 7.1
  • pthreads (optional for multi-threads support)
  • pcntl (optional for multi-processes support)
  • xmlreader (optional for XML file processing support)

Installation

Use docker

  There are a out-of-box docker environment for use. You can just pull and use it right away.

docker pull jhbian/pider
git clone https://github.com/duanqiaobb/pider
cd pider
composer install

Install into your laptop (only linux supported currently)

  You can run install.sh under root directory of the project whatever you prefer to install an environment into your laptop.

  git clone https://github.com/duanqiaobb/pider.git
  chmod u+x install.sh
  ./install.sh

Usage

     Hereinafter, I assume that you had set pider up ,not only environment but also the framework itself.

Basic Spider

    This spider crawles categories of product in index page of jd.com

cd pider
mkdir example
cd example && touch JdIndexCategorySpider.php
//In file JdIndexCategorySpider.php
<?php
use Pider\Spider;
use Pider\Http\Response;
    
class JdIndexCategorySpider extends Spider {
    protected $domains = 'www.jd.com';
    protected $urls = [
       'www.jd.com/'];
    
    //Parse data from response of requests
    public function parse(Response $response) {
        $response = $response->outputEncode('utf-8');
        $category_names = $response->xpath("//ul[contains(@class,'cate_menu')]/li/a/text()")->extract();
        var_dump($category_names);
    }
} 
../pider JdIndexCategorySpider.php
array(46) {
  [0] =>
  string(12) "家用电器"
  [1] =>
  string(6) "手机"
  [2] =>
  string(9) "运营商"
  [3] =>
  string(6) "数码"
  [4] =>
  string(6) "电脑"
  [5] =>
  string(6) "办公"
  [6] =>
  string(6) "家居"
  [7] =>
  string(6) "家具"
  ...
}

Spider with proxy

touch JdIndexCategoryWithProxySpider.php
use Pider\Spider;
use Pider\Http\Response;
    
class JdIndexCategoryWithProxySpider extends Spider {
    protected $domains = 'www.jd.com';
    protected $urls = [
       'www.jd.com/'];

    //Generate urls to be crawled
    public function start_requests():array {
        $std_urls = ['www.jd.com'];
        Request::proxy_handler(function(){
            return xxx(); //function return a proxy ip
        });
       return $std_urls; //url or Request object array
    }

    
    //Parse data from response of requests
    public function parse(Response $response) {
        $response = $response->outputEncode('utf-8');
        $category_names = $response->xpath("//ul[contains(@class,'cate_menu')]/li/a/text()")->extract();
        var_dump($category_names);
    }
} 
../pider Examplespider.php
array(46) {
  [0] =>
  string(12) "家用电器"
  [1] =>
  string(6) "手机"
  [2] =>
  string(9) "运营商"
  [3] =>
  string(6) "数码"
  [4] =>
  string(6) "电脑"
  [5] =>
  string(6) "办公"
  [6] =>
  string(6) "家居"
  [7] =>
  string(6) "家具"
  ...
}

Todo

Contribution

     If you have any ideas about this project, please don't hesitate to pull a request.