Fetch Googlebot IP ranges from their published JSON resource
Closed this issue · 1 comments
Google publishes the current IP ranges for Googlebot
: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot#automatic
Of course Legitbot
could fetch them with fetch:url
, similarly to how it works for Ahrefs:
legitbot/lib/legitbot/ahrefs.rb
Lines 6 to 7 in e5c8923
But we don't know the cadence of changes to this list and fetch:url
updates the Legitbot sources. Even with the automatic detection in place, the change would have to wait until the next release.
In order to dynamically fetch Googlebot IP ranges from their published JSON, ip_ranges
block can be used, similarly to how it works for Facebook:
legitbot/lib/legitbot/facebook.rb
Lines 10 to 19 in e5c8923
We probably need fetch:url
factored out from Rubocop cop sources though, so it can be easily accessible.
Though I have to add that I am against making pre-fetching the IP ranges list the default behaviour.
Currently implemented DNS-based approach is superior, because it relies on the DNS caching (including eviction). Only the first request may be slow, and all subsequent requests will utilise the cache. This somewhat increased latency of the first request is not a big deal for web crawlers and it does not affect human visitors.
Contrary, if someone wants to fetch IP ranges from an external resource, they would also be responsible for refreshing this list regularly using reload_ips
.