Create a web crawler that generates a sitemap for a given domain.
Create an application that takes a domain name as a parameter (e.g. google.com) and crawls the site producing an XML sitemap.
Requirements:
- Written in Ruby, Python or Clojure.
- Visit each page only once.
- Extract links from ‘href’ attributes on ‘a’ tags.
- Ignore links to other domains.
- Ignore links to subdomains other than ‘www’.
- Produce an output file called ‘sitemap.xml’ that contains one entry per crawled page. (see http://www.sitemaps.org/protocol.html).
- Each entry should contain a and element.
- The tag should contain the absolute URL to the page.
- The tag should contain how many times the page was linked to by other pages, scaled to fit in the range 0 to 1 (rounding to 2 decimal places).
Example Output
<?xml version="1.0" encoding="UTF8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.google.com/</loc>
<priority>1</priority>
</url>
<url>
<loc>http://www.google.com/about</loc>
<priority>0.97</priority>
</url>
.....
</urlset>
No 3rd party libraries are allowed for the main functionality (webcrawling, html parsing, writing XML).
- Python 2.7
pip install -r requirements.txt
From the directory that contains this README.md:
python -m crawler --help
python -m crawler <domain> > sitemap.xml
Example:
python -m crawler http://proxybay.info/ > /tmp/sitemap.xml
python -m crawler.tests