[Feature Request] Block crawler bots by default with robots.txt

Question

[Feature Request] Block crawler bots by default with robots.txt

hashFactory opened this issue 2 years ago · 2 comments

Hi all, I have a public-facing instance of Miniserve running at home and I've noticed that I often get flooded with requests for random files in my directories by Googlebot. (including requests for zipped download of entire directory)

I would love it if miniserve either by default, or through a switch, served a static robots.txt that disallows crawling by bots.

Google Developers has an example of how to formulate a robots.txt that disallows crawling here.

If there's some interest I don't mind trying to implement it myself but it would have to be in a few days.

Open to suggestions!

Answer 1 · 2023-01-07T01:39:54.000Z

Hm I guess that'd make sense. How about a switch --allow-crawlers that disables the automatic robots.txt?

Answer 2 · 2023-03-11T10:20:54.000Z

Hmm, this sounds like a reasonable, easy to implement thing! If you have a robots.txt yourself it would just always serve that, but if it doesn't it would simply serve a static robots.txt that disallows crawlers (needs to be implemented to work with random path generation). But then I think I'd rename the flag to something like --no-robots-txt.

The only concern I have is that running miniserve as a permanent web server seems like it's not exactly it's intended purpose and would be better left up to something like ~~megaserve~~ Nginx, right? But even despite that it seems like a very reasonable, small feature to add!