A python tool to crawl a website for all linked urls, similar to ScreamingFrog but without the bloat features.
Will crawl the website and follow links found within the HTML. There is a config file to set domains, robot.txt obey, black listed urls/snippits, and output file.
There is no guarentee to find all pages/posts/urls on a website, if the page is not linked anywhere on the website, this bot will not find it (targeted marketing landing pages for example).
No recurrsion limits and CONCURRENT_REQUESTS
set to 32 by default.
- Python3
- venv ::
sudo apt install python3.8-venv
- Scrapy :: handled in requirements.txt
Windows users are assumed to be in WSL
$ git clone git@github.com:techb/GenUrls.git
$ cd GenUrls
$ python3 -m venv venv
$ source venv/bin/activate
(venv)$ pip install -r requirements.txt
./scr/GenUrls/crawl_config.py
- DOMAINS :: allowed domains to crawl
- ENTRYURLS :: url where the spider starts.
- sitemap and/or homepage is a good starting point
- DENY :: strings,urls,paths to blacklist
- LOG_LEVEL :: level of output in terminal when activly crawling
- FEED_FORMAT :: results file format
csv
,json
- FEED_URI :: path and file name of the results, defaults to same dir with
RESULTS
as name - OBEY :: obey the robots.txt file,
False
will ignore it,True
will obey
(venv)$ cd src
(venv)$ vim GenUrls/crawl_config.py
- update the config file, save and exit
(venv)$ python run_spider.py
- Output will be in your chosen format
- start_url :: the found url, in case of redirect this is the url that has been redirected
- redirect :: the final redirected url, will be null/None if no redirects happened
- status :: HTTP status code of the start_url
Besides config, most dev happens in ./src/GenUrls/spiders/genurls_bot.py
.
Driving class of the crawling spider.
- name :: Spiders name
- allowed_domains :: List of domains that the spider can crawl.
./src/GenUrls/crawl_config.py
:: DOMAINS
- start_urls :: List of entry point urls to start the crawl.
./src/GenUrls/crawl_config.py
:: ENTRYURLS
- handle_httpstatus_list :: List of HTTP status codes to watch for.
- crawled_data :: List of Dict's holding all crawled urls and class data.
- rules :: List of Scrapy Rules containing a LinkExtractor, callback, follow, and deny.