/PyCrawler

A python web crawler

Primary LanguagePython

Setup

  • Open settings.py and adjust database settings
  • DATABASE_ENGINE can either be "mysql" or "sqlite"
  • For sqlite only DATABASE_HOST is used, and it should begin with a '/'
  • All other DATABASE_* settings are required for mysql
  • DEBUG mode causes the crawler to output some stats that are generated as it goes, and other debug messages
  • LOGGING is a dictConfig dictionary to log output to the console and a rotating file, and works out-of-the-box, but can be modified

Current State

  • mysql engine untested
  • Issue in some situations where the database is locked and queries cannot execute. Presumably an issue only with sqlite's file-based approach

Logging

  • DEBUG+ level messages are logged to the console, and INFO+ level messages are logged to a file.
  • By default, the file for logging uses a TimedRotatingFileHandler that rolls over at midnight
  • Setting DEBUG in the settings toggles wether or not DEBUG level messages are output at all
  • Setting USE_COLORS in the settings toggles whether or not messages output to the console use colors depending on the level.

Misc

  • Designed to be able to run on multiple machines and work together to collect info in central DB
  • Queues links into the database to be crawled. This means that any machine running the crawler with the central db can grab from the same queue. Reduces crawling redundancy.
  • Thread pool apprach to analyzing keywords in text.