everypolitician/scraped

AbsoluteUrls: print warning when URL can't be parsed

Opened this issue · 0 comments

Problem

At the moment if AbsoluteUrls can't parse a URL this rescue block is silently swallowing the exception. This makes it tricky to debug problems such as the one we had while doing everypolitician-scrapers/denmark-folketing#3, where image URLs with a space character in them weren't being parsed.

Proposed solution

Something like this:

diff --git a/lib/scraped/response/decorator/absolute_urls.rb b/lib/scraped/response/decorator/absolute_urls.rb
index 268a695..66902af 100644
--- a/lib/scraped/response/decorator/absolute_urls.rb
+++ b/lib/scraped/response/decorator/absolute_urls.rb
@@ -16,7 +16,8 @@ module Scraped
 
         def absolute_url(relative_url)
           URI.join(url, relative_url) unless relative_url.to_s.empty?
-        rescue URI::InvalidURIError
+        rescue URI::InvalidURIError => e
+          warn "Could not make #{relative_url.inspect} absolute: #{e.message}" if ENV['VERBOSE']
           relative_url
         end
       end