alaz/legitbot

Fetch Googlebot IP ranges from their published JSON resource

Closed this issue · 1 comments

alaz commented

Google publishes the current IP ranges for Googlebot: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot#automatic

Of course Legitbot could fetch them with fetch:url, similarly to how it works for Ahrefs:

# @fetch:url https://api.ahrefs.com/v3/public/crawler-ip-ranges?output=json
# @fetch:jsonpath $.prefixes[*].ipv4Prefix

But we don't know the cadence of changes to this list and fetch:url updates the Legitbot sources. Even with the automatic detection in place, the change would have to wait until the next release.

In order to dynamically fetch Googlebot IP ranges from their published JSON, ip_ranges block can be used, similarly to how it works for Facebook:

ip_ranges do
client = Irrc::Client.new
client.query :radb, AS, source: :radb
results = client.perform
%i[ipv4 ipv6].map do |family|
results[AS][family][AS]
end.flatten
end
end

We probably need fetch:url factored out from Rubocop cop sources though, so it can be easily accessible.

alaz commented

Though I have to add that I am against making pre-fetching the IP ranges list the default behaviour.

Currently implemented DNS-based approach is superior, because it relies on the DNS caching (including eviction). Only the first request may be slow, and all subsequent requests will utilise the cache. This somewhat increased latency of the first request is not a big deal for web crawlers and it does not affect human visitors.

Contrary, if someone wants to fetch IP ranges from an external resource, they would also be responsible for refreshing this list regularly using reload_ips.