/XYWYCrawler

A crawling application collecting data of all the questions from xywy ( url like http://club.xywy.com/keshi/2017-02-09/1.html )

Primary LanguagePythonMIT LicenseMIT

#XYWYCrawler, crawler in action!

Description: This application is used to collect data from a website ( question list by day ) which records is more more than 100 million , so it necessary to take some strategies to ensure that all the data can been crawled in an accepted time. The strategies taken are as following:

Strategies

  • Multithreading
  • Multiprocessing
  • Redis as the task queue
  • RPC to share the message source
  • DBHelper to keep a connections pool
  • Message consumer running 4 machines

FAQ

Welcome to contact me @ hit_oak_tree@126.com to discuss this question together.