Spider Man

高可用代理IP池高并发生成器一些实战经验

Highly Available Proxy IP Pool, Highly Concurrent Request Builder, Some Application

keyword
Proxy pool
Application
- Netease
- Press Test System
- News
- Youdao Note
- blog
- Brush Class
- zimuzu
- Bilibili
- shaoq
- eastmoney
Development
Structure
Design document
- exam.Shaoq
- eastmoney.eastmoney
  - Idea
  - Trouble Shooting

keyword

Big data store
High concurrency requests
Support WebSocket
method for font cheat
method for js compile
Some Application

Proxy pool

proxy pool is the heart of this project.

Highly Available Proxy IP Pool
- By obtaining data from Gatherproxy, Goubanjia, xici etc. Free Proxy WebSite
- Analysis of the Goubanjia port data
- Quickly verify IP availability
- Cooperate with Requests to automatically assign proxy Ip, with Retry mechanism, fail to write DB mechanism
- two models for proxy shell
  - model 1: load gather proxy list && update proxy list file(need over the GFW, your personality passwd in http://gatherproxy.com to proxy/data/passage one line by username, one line by passwd)
  - model 0: update proxy pool db && test available
- one common proxy api
  - from proxy.getproxy import GetFreeProxy
  - get_request_proxy = GetFreeProxy().get_request_proxy
  - get_request_proxy(url: str, types: int, data=None, test_func=None, header=None)
- also one common basic req api
  - from util import basic_req
  - basic_req(url: str, types: int, proxies=None, data=None, header=None)
- if you want spider by using proxy
  - because access proxy web need over the GFW, so maybe you can't use model 1 to download proxy file.
  - 1. download proxy txt from http://gatherproxy.com
  - 1. cp download_file proxy/data/gatherproxy
  - 1. python proxy/getproxy.py --model==0

Application

`Netease`

Netease Music song playlist crawl - netease/netease_music_db.py

problem: big data store
classify -> playlist id -> song_detail
V1 Write file, One run version, no proxy, no record progress mechanism
V1.5 Small amount of proxy IP
V2 Proxy IP pool, Record progress, Write to MySQL
- Optimize the write to DB Load data/ Replace INTO

`Press Test System`

Press Test System - press/press.py

problem: high concurrency requests
By highly available proxy IP pool to pretend user.
Give some web service uneven pressure
To do: press uniform

`News`

google & baidu info crawl - news/news.py

get news from search engine by Proxy Engine
one model: careful analysis DOM
the other model: rough analysis Chinese words

`Youdao Note`

Youdao Note documents crawl -buildmd/buildmd.py

load data from youdaoyun
by series of rules to deal data to .md

`blog`

csdn && zhihu && jianshu view info crawl - blog/titleview.py

`Brush Class`

PKU Class brush - brushclass/brushclass.py

when your expected class have places, It will send you some email.

`zimuzu`

ZiMuZu download list crawl - zimuzu/zimuzu.py

when you want to download lots of show like Season 22, Season 21.
If click one by one, It is very boring, so zimuzu.py is all you need.
The thing you only need do is to wait for the program run.
And you copy the Thunder URL for one to download the movies.
Now The Winter will come, I think you need it to review <Game of Thrones>.

`Bilibili`

Get av data by http - bilibili/bilibili.py

homepage rank -> check tids -> to check data every 2min(during on rank + one day)
monitor every rank av -> star num & basic data

Get av data by websocket - bilibili/bsocket.py

base on WebSocket
byte analysis
heartbeat

Get comment data by http - bilibili/bilibili.py

load comment from /x/v2/reply

`shaoq`

Get text data by compiling javascript - exam/shaoq.py

more detail

`eastmoney`

Get stock info by analysis font - eastmoney/eastmoney.py

font analysis

more detail

----To be continued----

Development

All model base on proxy.getproxy, so it is a very !import.

docker is on the road.

$ git clone https://github.com/iofu728/spider.git
$ cd spider
$ pip3 install -r requirement.txt

# using proxy pool
$ python proxy/getproxy.py --model=1         # model = 1: load gather proxy (now need have qualification to request google)
$ python proxy/getproxy.py --model=0         # model = 0: test proxy

$ from  proxy.getproxy import GetFreeProxy
$ get_request_proxy = GetFreeProxy().get_request_proxy
$ requests.gatherproxy(0) # load http proxy to pool
$ requests.get_request_proxy(url, types) # use proxy

# proxy shell
$ python blog/titleviews.py --model=1 >> log 2>&1 # model = 1: load gather model or python blog/titleviews.py --model=1 >> proxy.log 2>&1
$ python blog/titleviews.py --model=0 >> log 2>&1 # model = 0: update gather model

Structure

.
├── LICENSE
├── README.md
├── bilibili
│   ├── analysis.py                // data analysis
│   ├── bilibili.py                // bilibili basic
│   └── bsocket.py                 // bilibili websocket
├── blog
│   └── titleviews.py              // Zhihu && CSDN && jianshu
├── brushclass
│   └── brushclass.py              // PKU elective
├── buildmd
│   └── buildmd.py                 // Youdao Note
├── eastmoney
│   └── eastmoney.py               // font analysis
├── exam
│   ├── shaoq.js                   // jsdom
│   └── shaoq.py                   // compile js shaoq
├── log
├── netease
│   ├── netease_music_base.py
│   ├── netease_music_db.py        // Netease Music
│   └── table.sql
├── news
│   └── news.py                    // Google && Baidu
├── press
│   └── press.py                   // Press text
├── proxy
│   ├── getproxy.py                // Proxy pool
│   └── table.sql
├── requirement.txt
├── utils
│   ├── db.py
│   └── utils.py
└── zimuzu
    └── zimuzu.py                  // zimuzi

Design document

exam.Shaoq

Idea

get cookie
request image
requests after 5.5s
compile javascript code -> get css
analysic css

Requirement

pip3 install PyExecJS
yarn install add jsdom # npm install jsdom PS: not global

Trouble Shooting

Can't get true html

Wait time must be 5.5s.
So you can use threading or await asyncio.gather to request image
Coroutines and Tasks

Error: Cannot find module 'jsdom'

jsdom must install in local not in global

Cannot find module 'jsdom'

remove subtree & edit subtree & re.findall

subtree.extract()
subtree.string = new_string
parent_tree.find_all(re.compile('''))

eastmoney.eastmoney

Idea

get data from HTML -> json
get font map -> transform num
or load font analysis font(contrast with base)

Trouble Shooting

error: unpack requires a buffer of 20 bytes

requests.text -> str,
requests.content -> byte
Struct.error: unpack requires a buffer of 16 bytes

How to analysis font

use fonttools
get TTFont().getBestCamp()
contrast with base

configure file

cfg = ConfigParser()
cfg.read(assign_path, 'utf-8')
13.10read configure file

UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-10: ordinal not in range(128)

read/write in utf-8
with codecs.open(filename, 'r/w', encoding='utf-8')

`bilibili` some url return 404 like `http://api.bilibili.com/x/relation/stat?jsonp=jsonp&callback=__jp11&vmid=`

basic_req auto add host to headers, but this URL can't request in ‘Host’

solostart/spider

Spider Man

keyword

Proxy pool

Application

Netease

Press Test System

News

Youdao Note

blog

Brush Class

zimuzu

Bilibili

shaoq

eastmoney

Development

Structure

Design document

exam.Shaoq

Idea

Requirement

Trouble Shooting

Can't get true html

Error: Cannot find module 'jsdom'

remove subtree & edit subtree & re.findall

eastmoney.eastmoney

Idea

Trouble Shooting

error: unpack requires a buffer of 20 bytes

How to analysis font

configure file

UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-10: ordinal not in range(128)

bilibili some url return 404 like http://api.bilibili.com/x/relation/stat?jsonp=jsonp&callback=__jp11&vmid=

`Netease`

`Press Test System`

`News`

`Youdao Note`

`blog`

`Brush Class`

`zimuzu`

`Bilibili`

`shaoq`

`eastmoney`

`bilibili` some url return 404 like `http://api.bilibili.com/x/relation/stat?jsonp=jsonp&callback=__jp11&vmid=`