The sample demonstrates how to use Python to crawl Website pages via sitemap.xml and check broken links for every page.
pip install beautifulsoup4
- Read sitemap.xml of the target Website.
- Search for attribute 'href' to get all valid links in every page.
- Check link response code.
- Dump 404 error URLs to a text file.
- Specify sitemap:
request = build_request("http://kb.dynamsoft.com/sitemap.xml")
. python broken_links.py
.- Use
ctrl+c
to stop the program.