/sitemap-parser

This Python library is designed to scrape sitemaps from websites, providing a simple and efficient way to gather information about the structure of a website.

Primary LanguagePythonMIT LicenseMIT

Sitemap Parser

Robot searching for sitemaps

This is a Python library designed to parse XML sitemaps and sitemap index files from a given URL. It supports both standard XML sitemaps (which contain URLs) and sitemap index files (which contain links to other sitemaps). This tool is useful for extracting data such as URLs and modification dates from website sitemaps.

Acknowledgments

This is a fork of Dave O'Connor's site-map-parser. I couldn't have done this without his original work.

Features

  • Sitemap Parsing: Extract URLs from standard sitemaps.
  • Sitemap Index Parsing: Extract links to other sitemaps from sitemap index files.
  • Supports Caching: Use Hishel for caching responses and reducing redundant requests.
  • Handles Large Sitemaps: Capable of parsing large sitemaps and sitemap indexes efficiently.
  • Customizable Caching Options: Option to enable or disable caching while downloading sitemaps.

Installation

You can install the required dependencies via poetry or pip.

poetry add git+https://github.com/TheLovinator1/sitemap-parser.git
pip install git+https://github.com/TheLovinator1/sitemap-parser.git

Usage

The library provides a SiteMapParser class that can be used to parse sitemaps and sitemap indexes. You can pass a URL or raw XML data to the parser to extract the URLs or links to other sitemaps.

Parsing a Sitemap from a URL

from sitemap_parser import SitemapIndex, SiteMapParser, UrlSet

url = "https://www.webhallen.com/sitemap.xml" # Sitemap index
# url = "https://www.webhallen.com/sitemap.infoPages.xml" # Sitemap with URLs
parser = SiteMapParser(source=url)

if parser.has_sitemaps():
    sitemaps: SitemapIndex = parser.get_sitemaps()
    for sitemap in sitemaps:
        print(sitemap)

elif parser.has_urls():
    urls: UrlSet = parser.get_urls()
    for url in urls:
        print(url)

Parsing a Raw XML String

from sitemap_parser import SiteMapParser, UrlSet

xml_data = """
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>https://example.com/</loc>
        <lastmod>2023-09-27</lastmod>
        <changefreq>daily</changefreq>
        <priority>1.0</priority>
    </url>
    <url>
        <loc>https://example.com/about</loc>
        <lastmod>2023-09-27</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.8</priority>
    </url>
</urlset>
"""
parser = SiteMapParser(source=xml_data, is_data_string=True)
urls: UrlSet = parser.get_urls()
for url in urls:
    print(url)

Exporting Sitemap Data to JSON

You can export the parsed sitemap data to a JSON file using the JSONExporter class.

import json
from pprint import pprint

from sitemap_parser import JSONExporter, SiteMapParser

parser = SiteMapParser(source="https://www.webhallen.com/sitemap.infoPages.xml")
exporter = JSONExporter(data=parser)

if parser.has_urls():
    json_data: str = exporter.export_urls()
    json_data = json.loads(json_data)
    pprint(json_data)

if parser.has_sitemaps():
    json_data: str = exporter.export_sitemaps()
    json_data = json.loads(json_data)
    pprint(json_data)

Additional Features

Caching

The parser uses the hishel library for caching by default. You can disable caching if needed by passing the should_cache=False flag when creating the SiteMapParser instance.

parser = SiteMapParser(sitemap_url, should_cache=False)

Configuration

Caching: The caching feature uses Hishel, an efficient caching library. You can configure the caching directory or turn off caching completely.

Example:

parser = SiteMapParser(sitemap_url, cache_dir=Path("/path/to/cache"))

Disabling Logging

If you want to disable logging, you can adjust the logging level to logging.CRITICAL or higher. This will suppress all log messages below the CRITICAL level.

Here's an example of how to do this:

import logging

# Set the logging level to CRITICAL to disable logging
logging.getLogger("sitemap_parser").setLevel(logging.CRITICAL)

Contributing

Contributions are welcome! If you'd like to improve this project, feel free to submit a pull request. Please follow the guidelines below:

  1. Fork the Repository
  2. Create a New Branch
  3. Submit a Pull Request

License

This project is licensed under the MIT License.

Contact

If you have any questions or suggestions, please open an issue on the GitHub repository. You can also reach me via email at tlovinator@gmail.com or on Discord at TheLovinator#9276.

Happy parsing!