raviqqe/muffet

Add support for multiple URL's

cipriancraciun opened this issue · 2 comments

Assuming #38 is solved (i.e. muffet doesn't fetch the same URL twice), it would be useful to allow muffet to take multiple URLs.

For example, say one has both a www and a blog site that happen to share some resources. If one would be able to list both sites in the same muffet invocation, the shared URLs would be checked only once.

A different use-case would be in conjunction with --one-page-only (i.e. turning recursion off) and listing all known URLs on the command line.


Complementary with the multiple URLs, a separate option to read these URLs from a file would allow even more flexibility.

For example, one could take the sitemap.xml, process that to extract the URLs that search engines would actually crawl, put these URLs in a file, one per line, and instruct muffet to execute only on those URLs.

For example muffet --one-page-only --urls ./sitemap.txt would try all links listed in sitemap.txt without recursing.

Meanwhile muffet --urls ./sitemap.txt would try all links listed, but recurse for each link bun not crossing the domain listed in that URLs domain.

Is it possible to simply use a shell script or other scripting language for this use case? It would be just a one-line code, I guess.

For example,

for url in $(cat urls.txt); do muffet $url; done

How big is your URL list in your use case?

As said previously, the main use case is when a set of different domains share a lot of common URL's, thus checking each of them independently might just generate needless traffic, and could get one throttled, especially by GitHub or CloudFlare.

The proposal of having the list of URLs read from a file would provide an alternative of supporting multiple domains, but would also provide support for checking only a smaller part of a larger site, without having to resort to complex exclusion regular expressions.

Or, for example, one could use muffet to crawl an entire site, extract pages that have broken links, fix only those, and re-run muffet only on those links with --one-page-only, but again, by allowing shared resources to be crawled only once.