/SitemapCreator

A tool that scrapes a website for all links (local to the domain) and stores them in a sitemap file, ready to be uploaded to Google

Primary LanguagePython

SitemapCreator - a web page scraper in Python

About

SitemapCreator is a Python tool to scrape a domain for all active links that points to the same domain. All URLs are written to a sitemap file, which in turn is ready to be uploaded to Google for faster indexing of the domain.

Usage

Clone the repository:

$ git clone https://github.com/WelcomWeb/SitemapCreator.git

Run SitemapCreator and write all domain-local URLs of http://www.MyDomain.com to a file called MySitemap.txt:

$ cd SitemapCreator
$ python sitemap.creator.py http://www.mydomain.com MySitemap.txt

Argument list

sitemap.creator.py <HOST> <FILENAME> [TIMEOUT=False] [VERBOSE=False]

The optional arguments

Some servers are configured to block a high amount of hits from one and the same IP (to prevent DoS attacks), so by sending SitemapCreator a timeout value we can delay each request to prevent the tool from being blocked by the server. The specified timeout should be in seconds, so with the following parameters SitemapCreator delays each request by 0.5 seconds (500ms):

$ python sitemap.creator.py http://www.mydomain.com MySitemap.txt 0.5

By default SitemapCreator is silent, and doesn't tell you what it's doing - but by adding a parameter we can tell it to be verbose;

$ python sitemap.creator.py http://www.mydomain.com MySitemap.txt 0.5 true

githalytics.com alpha