/crawler

Primary LanguageJupyter Notebook

basic knowledge

html

URL wki 講解
W3C 超連結屬性
關於 robots.txt

selector

quotes_toscrape- 練習網頁
Python Scrapy Tutorial-9-Extracting data w/CSS Selectors
Python Scrapy Tutorial-10-Extracting data w/XPATH
輕鬆學習 Python:在學習網站爬蟲之前

re

regexone
pythex
常用 Regular Expression 範例
regex101-解釋過程

tools for crawler

chrome dev tool
udemy Chrome 網頁除錯功能大解密
google/robotstxt
Python 爬蟲不求人之 Splash HTTP API 篇
selectorgadget - google extension

mutil-crawler

python-threading-multithreaded-programming
Python — 多線程
Python Semaphore does not seem to work in Google Colab
Raymond Hettinger, Keynote on Concurrency, PyBay 2017
Thinking about Concurrency, Raymond Hettinger, Python core developer
What Is Async, How Does It Work, and When Should I Use It? (PyCon APAC 2014)

proxy

4 款爬虫抓包神器
Free Proxy List
spys
最新**ip地址代理服务器

Address to lat_lng

台灣電子地圖服務網
全國門牌地址定位服務
水利地理資訊平台
國土規劃地理資訊圖台

斷詞工具

CkipTagger
jieba-zh_TW

user-agent

fake-a-browser-visit
user_agent_string