Obeying the robots.txt file

Question

Obeying the robots.txt file

TechnologyClassroom opened this issue a year ago · 1 comments

TechnologyClassroom commented a year ago

I've recently found this project while reading server logs. Someone is scraping one of the sites that I help administer supposedly using AHC/2.1 and they are not obeying the robots.txt file. There should be several seconds of delay between requests, but it appears to be going a 1 request/second. Is this normal behavior for AHC or is this a user misconfiguration in some way? If this is normal, could robots.txt file support for Crawl-delay values be added by default?

Answer 1 · 2024-08-31T19:10:02.000Z

The user must have configured it to crawl your web server every 1 second. AHC is an HTTP client library and it's clearly up to the user how they intend to use it. Also, there are no plans to support robots.txt at the moment.