FinTech Spider

FinTech(i.e. Financial Technology)

"FinTech Spider" is a spider based on Scrapy to crawl a large number of financial data on the Internet.

The data crawled by "FinTech Spider" has been used by 嗅金牛, 数知源.

Only important dirs & files are listed here.

Directory/File	Author	Usage
README.md	lxw	The document for this project

Anti_Anti_Spider/	hee

Demo/		Some Demonstrations(e.g. PhantomJS/Proxies, etc.)
Demo/ArticleSpider/	hee
Demo/CNKI_Patent/	lxw	A demo for Scrapy spiders project which supports Selenium/PhantomJS/User-Agent/IP-Proxy
Demo/geetestcrack.py	hee
Demo/phantomjs_proxy.py	lxw	Add IP proxy in PhantomJS
Demo/user_agent.txt	hee	A large number of User-Agents

Spiders/		The Spiders directory stores Python scripts that crawl data we need from the Internet)
Spiders/CJODocIDSpider/	lxw	(w/ scrapy)Spiders for crawling data(case details) from **裁判文书网(China Judgements Online)
Spiders/CJOSpider/	lxw	(w/ scrapy)Spiders for crawling data(basic info) from **裁判文书网(China Judgements Online)
Spiders/CninfoSpider/	hee	Spiders for crawling data from 巨潮资讯
Spiders/CNKI_Patent_Spider/	lxw	(w/o scrapy)Spiders for crawling patent data from **知网
Spiders/NECIPSSpider/	lxw	(w/ scrapy)Spiders for crawling data from 国家企业信用信息公示系统(National Enterprise Credit Information Publicity System)
Spiders/new_three_board/	lxw	(w/ scrapy)Spiders for crawling data from 全国中小企业股份转让系统
Spiders/SBJSpider/	hee
Spiders/TYCSpider/	lxw	(w scrapy, PhantomJS)Spiders for crawling patent/copyright data from 天眼查

【比rpush可能会稍微好一点儿，这个暂时不改了，感觉怎么改都会有问题】proxy的获取策略改成lpop() + insert(第六个位置)，而不是lpop() + rpush()
[NO, 按理说只用CJOSpider.py然后重新运行就可以] 增加对Redis中TASKS_HASH没有爬取结束任务的爬取代码(一定小于CONCURRENT_REQUESTS个?)
[NO, 按理说只用CJODocIDSpider.py然后重新运行就可以] 增加对Redis中DOC_ID_HASH没有爬取结束任务的爬取代码

iaminblacklist/fintech_spider