FinTech(i.e. Financial Technology)
"FinTech Spider" is a spider based on Scrapy to crawl a large number of financial data on the Internet.
The data crawled by "FinTech Spider" has been used by 嗅金牛, 数知源.
Only important dirs & files are listed here.
Directory/File | Author | Usage |
---|---|---|
README.md | lxw | The document for this project |
Anti_Anti_Spider/ | hee | |
Demo/ | Some Demonstrations(e.g. PhantomJS/Proxies, etc.) | |
Demo/ArticleSpider/ | hee | |
Demo/CNKI_Patent/ | lxw | A demo for Scrapy spiders project which supports Selenium/PhantomJS/User-Agent/IP-Proxy |
Demo/geetestcrack.py | hee | |
Demo/phantomjs_proxy.py | lxw | Add IP proxy in PhantomJS |
Demo/user_agent.txt | hee | A large number of User-Agents |
Spiders/ | The Spiders directory stores Python scripts that crawl data we need from the Internet) | |
Spiders/CJODocIDSpider/ | lxw | (w/ scrapy)Spiders for crawling data(case details) from **裁判文书网(China Judgements Online) |
Spiders/CJOSpider/ | lxw | (w/ scrapy)Spiders for crawling data(basic info) from **裁判文书网(China Judgements Online) |
Spiders/CninfoSpider/ | hee | Spiders for crawling data from 巨潮资讯 |
Spiders/CNKI_Patent_Spider/ | lxw | (w/o scrapy)Spiders for crawling patent data from **知网 |
Spiders/NECIPSSpider/ | lxw | (w/ scrapy)Spiders for crawling data from 国家企业信用信息公示系统(National Enterprise Credit Information Publicity System) |
Spiders/new_three_board/ | lxw | (w/ scrapy)Spiders for crawling data from 全国中小企业股份转让系统 |
Spiders/SBJSpider/ | hee | |
Spiders/TYCSpider/ | lxw | (w scrapy, PhantomJS)Spiders for crawling patent/copyright data from 天眼查 |
- 在README.md中更新所提交的关键目录的用途(如果子目录中有关键的文件,也请列出)
- CJOSpider CJOSpider架构存在问题,把URL去重关闭了, 可能会存在重复抓取的问题
- 【比rpush可能会稍微好一点儿,这个暂时不改了,感觉怎么改都会有问题】proxy的获取策略改成lpop() + insert(第六个位置),而不是lpop() + rpush()
- [NO, 按理说只用CJOSpider.py然后重新运行就可以] 增加对Redis中TASKS_HASH没有爬取结束任务的爬取代码(一定小于CONCURRENT_REQUESTS个?)
- [NO, 按理说只用CJODocIDSpider.py然后重新运行就可以] 增加对Redis中DOC_ID_HASH没有爬取结束任务的爬取代码