MarginaliaSearch/MarginaliaSearch

(crawler) Implement sitemap support

Closed this issue · 1 comments

Sitemaps are currently not supported. Implementing sitemap support might help the crawler with URL discovery on some sites.

There are some risks though. Some sitemaps are huge. Look at neocities' sitemaps for example. It's a sitemap of all of neocities. This needs to be dealt with gracefully. There probably needs to be some sort of fast-failing upper limit to avoid exposing the crawler to OOM problems.

Some sitemaps also contain URLs for other domains. Since Marginalia's crawler is designed to operate on a one-domain-at-a-time fashion, these may need to be ignored initially.

Maybe look at Google's specs? https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap
Also: https://www.sitemaps.org/protocol.html

Implemented.