/GenUrls

Python tool crawl websites and follow links found within the HTML. There is a config file to set domains, robot.txt obey, black listed urls/snippits, and output file.

Primary LanguagePython

GenUrls

A python tool to crawl a website for all linked urls, similar to ScreamingFrog but without the bloat features.

Will crawl the website and follow links found within the HTML. There is a config file to set domains, robot.txt obey, black listed urls/snippits, and output file.

There is no guarentee to find all pages/posts/urls on a website, if the page is not linked anywhere on the website, this bot will not find it (targeted marketing landing pages for example).

No recurrsion limits and CONCURRENT_REQUESTS set to 32 by default.

Requirements


  • Python3
  • venv :: sudo apt install python3.8-venv
  • Scrapy :: handled in requirements.txt

Install


Windows users are assumed to be in WSL

  • $ git clone git@github.com:techb/GenUrls.git
  • $ cd GenUrls
  • $ python3 -m venv venv
  • $ source venv/bin/activate
  • (venv)$ pip install -r requirements.txt

Settings


  • ./scr/GenUrls/crawl_config.py
    • DOMAINS :: allowed domains to crawl
    • ENTRYURLS :: url where the spider starts.
      • sitemap and/or homepage is a good starting point
    • DENY :: strings,urls,paths to blacklist
    • LOG_LEVEL :: level of output in terminal when activly crawling
    • FEED_FORMAT :: results file format csv, json
    • FEED_URI :: path and file name of the results, defaults to same dir with RESULTS as name
    • OBEY :: obey the robots.txt file, False will ignore it, True will obey

Run


  • (venv)$ cd src
  • (venv)$ vim GenUrls/crawl_config.py
    • update the config file, save and exit
  • (venv)$ python run_spider.py
  • Output will be in your chosen format
    • start_url :: the found url, in case of redirect this is the url that has been redirected
    • redirect :: the final redirected url, will be null/None if no redirects happened
    • status :: HTTP status code of the start_url

Dev


Besides config, most dev happens in ./src/GenUrls/spiders/genurls_bot.py.

GenUrlsBot

Driving class of the crawling spider.

Properties

  • name :: Spiders name
  • allowed_domains :: List of domains that the spider can crawl.
    • ./src/GenUrls/crawl_config.py :: DOMAINS
  • start_urls :: List of entry point urls to start the crawl.
    • ./src/GenUrls/crawl_config.py :: ENTRYURLS
  • handle_httpstatus_list :: List of HTTP status codes to watch for.
  • crawled_data :: List of Dict's holding all crawled urls and class data.
  • rules :: List of Scrapy Rules containing a LinkExtractor, callback, follow, and deny.