/spider

๐Ÿ•ทsome website spider application base on proxy pool (support http & websocket)

Primary LanguagePythonMIT LicenseMIT

Spider logo

Spider Man

GitHub GitHub tag GitHub code size in bytes

้ซ˜ๅฏ็”จไปฃ็†IPๆฑ  ้ซ˜ๅนถๅ‘็”Ÿๆˆๅ™จ ไธ€ไบ›ๅฎžๆˆ˜็ป้ชŒ
Highly Available Proxy IP Pool, Highly Concurrent Request Builder, Some Application

keyword

  • Big data store
  • High concurrency requests
  • Support WebSocket
  • method for font cheat
  • method for js compile
  • Some Application

Proxy pool

proxy pool is the heart of this project.

  • Highly Available Proxy IP Pool
    • By obtaining data from Gatherproxy, Goubanjia, xici etc. Free Proxy WebSite
    • Analysis of the Goubanjia port data
    • Quickly verify IP availability
    • Cooperate with Requests to automatically assign proxy Ip, with Retry mechanism, fail to write DB mechanism
    • two models for proxy shell
      • model 1: load gather proxy list && update proxy list file(need over the GFW, your personality passwd in http://gatherproxy.com to proxy/data/passage one line by username, one line by passwd)
      • model 0: update proxy pool db && test available
    • one common proxy api
      • from proxy.getproxy import GetFreeProxy
      • get_request_proxy = GetFreeProxy().get_request_proxy
      • get_request_proxy(url: str, types: int, data=None, test_func=None, header=None)
    • also one common basic req api
      • from util import basic_req
      • basic_req(url: str, types: int, proxies=None, data=None, header=None)
    • if you want spider by using proxy
      • because access proxy web need over the GFW, so maybe you can't use model 1 to download proxy file.
        1. download proxy txt from http://gatherproxy.com
        1. cp download_file proxy/data/gatherproxy
        1. python proxy/getproxy.py --model==0

Application

Netease

  1. Netease Music song playlist crawl - netease/netease_music_db.py
  • problem: big data store
  • classify -> playlist id -> song_detail
  • V1 Write file, One run version, no proxy, no record progress mechanism
  • V1.5 Small amount of proxy IP
  • V2 Proxy IP pool, Record progress, Write to MySQL
    • Optimize the write to DB Load data/ Replace INTO

Press Test System

  1. Press Test System - press/press.py
  • problem: high concurrency requests
  • By highly available proxy IP pool to pretend user.
  • Give some web service uneven pressure
  • To do: press uniform

News

  1. google & baidu info crawl - news/news.py
  • get news from search engine by Proxy Engine
  • one model: careful analysis DOM
  • the other model: rough analysis Chinese words

Youdao Note

  1. Youdao Note documents crawl -buildmd/buildmd.py
  • load data from youdaoyun
  • by series of rules to deal data to .md

blog

  1. csdn && zhihu && jianshu view info crawl - blog/titleview.py

Brush Class

  1. PKU Class brush - brushclass/brushclass.py
  • when your expected class have places, It will send you some email.

zimuzu

  1. ZiMuZu download list crawl - zimuzu/zimuzu.py
  • when you want to download lots of show like Season 22, Season 21.
  • If click one by one, It is very boring, so zimuzu.py is all you need.
  • The thing you only need do is to wait for the program run.
  • And you copy the Thunder URL for one to download the movies.
  • Now The Winter will come, I think you need it to review <Game of Thrones>.

Bilibili

  1. Get av data by http - bilibili/bilibili.py
  • homepage rank -> check tids -> to check data every 2min(during on rank + one day)
  • monitor every rank av -> star num & basic data
  1. Get av data by websocket - bilibili/bsocket.py
  • base on WebSocket
  • byte analysis
  • heartbeat
  1. Get comment data by http - bilibili/bilibili.py
  • load comment from /x/v2/reply

shaoq

  1. Get text data by compiling javascript - exam/shaoq.py

more detail

eastmoney

  1. Get stock info by analysis font - eastmoney/eastmoney.py
  • font analysis

more detail

----To be continued----

Development

All model base on proxy.getproxy, so it is a very !import.

docker is on the road.

$ git clone https://github.com/iofu728/spider.git
$ cd spider
$ pip3 install -r requirement.txt

# using proxy pool
$ python proxy/getproxy.py --model=1         # model = 1: load gather proxy (now need have qualification to request google)
$ python proxy/getproxy.py --model=0         # model = 0: test proxy

$ from  proxy.getproxy import GetFreeProxy
$ get_request_proxy = GetFreeProxy().get_request_proxy
$ requests.gatherproxy(0) # load http proxy to pool
$ requests.get_request_proxy(url, types) # use proxy

# proxy shell
$ python blog/titleviews.py --model=1 >> log 2>&1 # model = 1: load gather model or python blog/titleviews.py --model=1 >> proxy.log 2>&1
$ python blog/titleviews.py --model=0 >> log 2>&1 # model = 0: update gather model

Structure

.
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ bilibili
โ”‚   โ”œโ”€โ”€ analysis.py                // data analysis
โ”‚   โ”œโ”€โ”€ bilibili.py                // bilibili basic
โ”‚   โ””โ”€โ”€ bsocket.py                 // bilibili websocket
โ”œโ”€โ”€ blog
โ”‚   โ””โ”€โ”€ titleviews.py              // Zhihu && CSDN && jianshu
โ”œโ”€โ”€ brushclass
โ”‚   โ””โ”€โ”€ brushclass.py              // PKU elective
โ”œโ”€โ”€ buildmd
โ”‚   โ””โ”€โ”€ buildmd.py                 // Youdao Note
โ”œโ”€โ”€ eastmoney
โ”‚   โ””โ”€โ”€ eastmoney.py               // font analysis
โ”œโ”€โ”€ exam
โ”‚   โ”œโ”€โ”€ shaoq.js                   // jsdom
โ”‚   โ””โ”€โ”€ shaoq.py                   // compile js shaoq
โ”œโ”€โ”€ log
โ”œโ”€โ”€ netease
โ”‚   โ”œโ”€โ”€ netease_music_base.py
โ”‚   โ”œโ”€โ”€ netease_music_db.py        // Netease Music
โ”‚   โ””โ”€โ”€ table.sql
โ”œโ”€โ”€ news
โ”‚   โ””โ”€โ”€ news.py                    // Google && Baidu
โ”œโ”€โ”€ press
โ”‚   โ””โ”€โ”€ press.py                   // Press text
โ”œโ”€โ”€ proxy
โ”‚   โ”œโ”€โ”€ getproxy.py                // Proxy pool
โ”‚   โ””โ”€โ”€ table.sql
โ”œโ”€โ”€ requirement.txt
โ”œโ”€โ”€ utils
โ”‚   โ”œโ”€โ”€ db.py
โ”‚   โ””โ”€โ”€ utils.py
โ””โ”€โ”€ zimuzu
    โ””โ”€โ”€ zimuzu.py                  // zimuzi

Design document

exam.Shaoq

Idea

  1. get cookie
  2. request image
  3. requests after 5.5s
  4. compile javascript code -> get css
  5. analysic css

Requirement

pip3 install PyExecJS
yarn install add jsdom # npm install jsdom PS: not global

Trouble Shooting

Can't get true html
  • Wait time must be 5.5s.

  • So you can use threading or await asyncio.gather to request image

  • Coroutines and Tasks

Error: Cannot find module 'jsdom'

jsdom must install in local not in global

remove subtree & edit subtree & re.findall
subtree.extract()
subtree.string = new_string
parent_tree.find_all(re.compile('''))

eastmoney.eastmoney

Idea

  1. get data from HTML -> json
  2. get font map -> transform num
  3. or load font analysis font(contrast with base)

Trouble Shooting

error: unpack requires a buffer of 20 bytes
How to analysis font
  • use fonttools
  • get TTFont().getBestCamp()
  • contrast with base
configure file
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-10: ordinal not in range(128)
  • read/write in utf-8
  • with codecs.open(filename, 'r/w', encoding='utf-8')
bilibili some url return 404 like http://api.bilibili.com/x/relation/stat?jsonp=jsonp&callback=__jp11&vmid=

basic_req auto add host to headers, but this URL can't request in โ€˜Hostโ€™