This is a Python library designed to parse XML sitemaps and sitemap index files from a given URL. It supports both standard XML sitemaps (which contain URLs) and sitemap index files (which contain links to other sitemaps). This tool is useful for extracting data such as URLs and modification dates from website sitemaps.
This is a fork of Dave O'Connor's site-map-parser. I couldn't have done this without his original work.
- Sitemap Parsing: Extract URLs from standard sitemaps.
- Sitemap Index Parsing: Extract links to other sitemaps from sitemap index files.
- Supports Caching: Use Hishel for caching responses and reducing redundant requests.
- Handles Large Sitemaps: Capable of parsing large sitemaps and sitemap indexes efficiently.
- Customizable Caching Options: Option to enable or disable caching while downloading sitemaps.
You can install the required dependencies via poetry or pip.
poetry add git+https://github.com/TheLovinator1/sitemap-parser.git
pip install git+https://github.com/TheLovinator1/sitemap-parser.git
The library provides a SiteMapParser class that can be used to parse sitemaps and sitemap indexes. You can pass a URL or raw XML data to the parser to extract the URLs or links to other sitemaps.
from sitemap_parser import SitemapIndex, SiteMapParser, UrlSet
url = "https://www.webhallen.com/sitemap.xml" # Sitemap index
# url = "https://www.webhallen.com/sitemap.infoPages.xml" # Sitemap with URLs
parser = SiteMapParser(source=url)
if parser.has_sitemaps():
sitemaps: SitemapIndex = parser.get_sitemaps()
for sitemap in sitemaps:
print(sitemap)
elif parser.has_urls():
urls: UrlSet = parser.get_urls()
for url in urls:
print(url)
from sitemap_parser import SiteMapParser, UrlSet
xml_data = """
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2023-09-27</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://example.com/about</loc>
<lastmod>2023-09-27</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
</urlset>
"""
parser = SiteMapParser(source=xml_data, is_data_string=True)
urls: UrlSet = parser.get_urls()
for url in urls:
print(url)
You can export the parsed sitemap data to a JSON file using the JSONExporter class.
import json
from pprint import pprint
from sitemap_parser import JSONExporter, SiteMapParser
parser = SiteMapParser(source="https://www.webhallen.com/sitemap.infoPages.xml")
exporter = JSONExporter(data=parser)
if parser.has_urls():
json_data: str = exporter.export_urls()
json_data = json.loads(json_data)
pprint(json_data)
if parser.has_sitemaps():
json_data: str = exporter.export_sitemaps()
json_data = json.loads(json_data)
pprint(json_data)
The parser uses the hishel library for caching by default. You can disable caching if needed by passing the should_cache=False flag when creating the SiteMapParser instance.
parser = SiteMapParser(sitemap_url, should_cache=False)
Caching: The caching feature uses Hishel, an efficient caching library. You can configure the caching directory or turn off caching completely.
Example:
parser = SiteMapParser(sitemap_url, cache_dir=Path("/path/to/cache"))
If you want to disable logging, you can adjust the logging level to logging.CRITICAL or higher. This will suppress all log messages below the CRITICAL level.
Here's an example of how to do this:
import logging
# Set the logging level to CRITICAL to disable logging
logging.getLogger("sitemap_parser").setLevel(logging.CRITICAL)
Contributions are welcome! If you'd like to improve this project, feel free to submit a pull request. Please follow the guidelines below:
- Fork the Repository
- Create a New Branch
- Submit a Pull Request
This project is licensed under the MIT License.
If you have any questions or suggestions, please open an issue on the GitHub repository. You can also reach me via email at tlovinator@gmail.com or on Discord at TheLovinator#9276.
Happy parsing!