้ซๅฏ็จไปฃ็IPๆฑ ้ซๅนถๅ็ๆๅจ ไธไบๅฎๆ็ป้ช
Highly Available Proxy IP Pool, Highly Concurrent Request Builder, Some Application
- keyword
- Proxy pool
- Application
- Development
- Structure
- Design document
- Big data store
- High concurrency requests
- Support WebSocket
- method for font cheat
- method for js compile
- Some Application
proxy pool is the heart of this project.
Highly Available Proxy IP Pool
- By obtaining data from
Gatherproxy
,Goubanjia
,xici
etc. Free Proxy WebSite - Analysis of the Goubanjia port data
- Quickly verify IP availability
- Cooperate with Requests to automatically assign proxy Ip, with Retry mechanism, fail to write DB mechanism
- two models for proxy shell
- model 1: load gather proxy list && update proxy list file(need over the GFW, your personality passwd in http://gatherproxy.com to
proxy/data/passage
one line by username, one line by passwd) - model 0: update proxy pool db && test available
- model 1: load gather proxy list && update proxy list file(need over the GFW, your personality passwd in http://gatherproxy.com to
- one common proxy api
from proxy.getproxy import GetFreeProxy
get_request_proxy = GetFreeProxy().get_request_proxy
get_request_proxy(url: str, types: int, data=None, test_func=None, header=None)
- also one common basic req api
from util import basic_req
basic_req(url: str, types: int, proxies=None, data=None, header=None)
- if you want spider by using proxy
- because access proxy web need over the GFW, so maybe you can't use
model 1
to download proxy file. -
- download proxy txt from http://gatherproxy.com
-
- cp download_file proxy/data/gatherproxy
-
- python proxy/getproxy.py --model==0
- because access proxy web need over the GFW, so maybe you can't use
- By obtaining data from
Netease Music song playlist crawl
-netease/netease_music_db.py
- problem:
big data store
- classify -> playlist id -> song_detail
- V1 Write file, One run version, no proxy, no record progress mechanism
- V1.5 Small amount of proxy IP
- V2 Proxy IP pool, Record progress, Write to MySQL
- Optimize the write to DB
Load data/ Replace INTO
- Optimize the write to DB
Press Test System
-press/press.py
- problem:
high concurrency requests
- By highly available proxy IP pool to pretend user.
- Give some web service uneven pressure
- To do: press uniform
google & baidu info crawl
-news/news.py
- get news from search engine by Proxy Engine
- one model: careful analysis
DOM
- the other model: rough analysis
Chinese words
Youdao Note documents crawl
-buildmd/buildmd.py
- load data from
youdaoyun
- by series of rules to deal data to .md
csdn && zhihu && jianshu view info crawl
-blog/titleview.py
PKU Class brush
-brushclass/brushclass.py
- when your expected class have places, It will send you some email.
ZiMuZu download list crawl
-zimuzu/zimuzu.py
- when you want to download lots of show like Season 22, Season 21.
- If click one by one, It is very boring, so zimuzu.py is all you need.
- The thing you only need do is to wait for the program run.
- And you copy the Thunder URL for one to download the movies.
- Now The Winter will come, I think you need it to review
<Game of Thrones>
.
Get av data by http
-bilibili/bilibili.py
homepage rank
-> checktids
-> to check data every 2min(during on rank + one day)- monitor every rank av -> star num & basic data
Get av data by websocket
-bilibili/bsocket.py
- base on WebSocket
- byte analysis
- heartbeat
Get comment data by http
-bilibili/bilibili.py
- load comment from
/x/v2/reply
Get text data by compiling javascript
-exam/shaoq.py
Get stock info by analysis font
-eastmoney/eastmoney.py
- font analysis
----To be continued----
All model base on proxy.getproxy
, so it is a very !import.
docker
is on the road.
$ git clone https://github.com/iofu728/spider.git
$ cd spider
$ pip3 install -r requirement.txt
# using proxy pool
$ python proxy/getproxy.py --model=1 # model = 1: load gather proxy (now need have qualification to request google)
$ python proxy/getproxy.py --model=0 # model = 0: test proxy
$ from proxy.getproxy import GetFreeProxy
$ get_request_proxy = GetFreeProxy().get_request_proxy
$ requests.gatherproxy(0) # load http proxy to pool
$ requests.get_request_proxy(url, types) # use proxy
# proxy shell
$ python blog/titleviews.py --model=1 >> log 2>&1 # model = 1: load gather model or python blog/titleviews.py --model=1 >> proxy.log 2>&1
$ python blog/titleviews.py --model=0 >> log 2>&1 # model = 0: update gather model
.
โโโ LICENSE
โโโ README.md
โโโ bilibili
โ โโโ analysis.py // data analysis
โ โโโ bilibili.py // bilibili basic
โ โโโ bsocket.py // bilibili websocket
โโโ blog
โ โโโ titleviews.py // Zhihu && CSDN && jianshu
โโโ brushclass
โ โโโ brushclass.py // PKU elective
โโโ buildmd
โ โโโ buildmd.py // Youdao Note
โโโ eastmoney
โ โโโ eastmoney.py // font analysis
โโโ exam
โ โโโ shaoq.js // jsdom
โ โโโ shaoq.py // compile js shaoq
โโโ log
โโโ netease
โ โโโ netease_music_base.py
โ โโโ netease_music_db.py // Netease Music
โ โโโ table.sql
โโโ news
โ โโโ news.py // Google && Baidu
โโโ press
โ โโโ press.py // Press text
โโโ proxy
โ โโโ getproxy.py // Proxy pool
โ โโโ table.sql
โโโ requirement.txt
โโโ utils
โ โโโ db.py
โ โโโ utils.py
โโโ zimuzu
โโโ zimuzu.py // zimuzi
- get cookie
- request image
- requests after 5.5s
- compile javascript code -> get css
- analysic css
pip3 install PyExecJS
yarn install add jsdom # npm install jsdom PS: not global
-
Wait time must be 5.5s.
-
So you can use
threading
orawait asyncio.gather
to request image
jsdom must install in local not in global
subtree.extract()
subtree.string = new_string
parent_tree.find_all(re.compile('''))
- get data from HTML -> json
- get font map -> transform num
- or load font analysis font(contrast with base)
-
requests.text -> str,
-
requests.content -> byte
- use fonttools
- get TTFont().getBestCamp()
- contrast with base
- cfg = ConfigParser()
- cfg.read(assign_path, 'utf-8')
- 13.10read configure file
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-10: ordinal not in range(128)
- read/write in
utf-8
- with codecs.open(filename, 'r/w', encoding='utf-8')
bilibili
some url return 404 like http://api.bilibili.com/x/relation/stat?jsonp=jsonp&callback=__jp11&vmid=
basic_req auto add host
to headers, but this URL can't request in โHostโ