everypolitician/scraped

"invalid byte sequence in UTF-8" from CleanURLs decorator

Closed this issue · 4 comments

When parsing http://www.duma.gov.ru/structure/deputies/1756684/, with the CleanURLs decorator turned on, I get a run-time error:

ArgumentError: invalid byte sequence in UTF-8
from /Users/tony/.rbenv/versions/2.3.3/lib/ruby/2.3.0/uri/rfc2396_parser.rb:305:in `gsub'

This comes from the URL:

=> "http://www.cir.ru/duma/servlet/is4.wwwmain?FormName=ProcessQuery&Action=RunQuery&PDCList=*&QueryString=/GD_\xC4\xC5\xCF\xD3\xD2\xC0\xD2=\"\xC0\xC1\xD0\xC0\xCC\xCE\xC2+\xC8.\xCD.\""

It took quite a bit of digging to even discover that this was coming from the decoration. This decorator should never generate a run-time error like this: if it fails to convert a URL, it should gracefully move on.

Minimal test case showing the problem:

URI.encode(URI.decode('http://cir.duma.gov.ru/duma/servlet/is4.wwwmain?FormName=ProcessQuery&Action=RunQuery&PDCList=*&QueryString=%2FGD_%C4%C5%CF%D3%D2%C0%D2%3D%22%C0%C1%D0%C0%CC%CE%C2+%C8.%CD.%22'))

Looks like the fix might be to rescue from StandardError rather than rescuing from URI::Error, as we're currently doing.

mhl commented

I'm not sure if this is of particular interest, since @tmtmtmtm suggests that the idea is that any error should cause processing to continue without an exception (or maybe it's obvious anyway) but: the percent-encoded bytes look like they are from the Windows-1252 Cyrillic character encoding, when URLs should only use percent-encoded UTF-8 byte sequences. For me that means there's no reasonable way to fix this kind of problem automatically (since guessing character sets is error prone and something you'd only want to do if there's really no alternative), so skipping such cases seems quite reasonable.

the idea is that any error should cause processing to continue without an exception

Yes, I think the nature of pretty much any Scraped Decorator should be to do its best to make the underlying HTML more like what the user would like, but if it's unable to improve on what's currently there, it should gracefully move on, rather than preventing the scraper from even running.

In this case, I suspect that the failing link isn't one that scraper is even interested in, so otherwise the choice in the scraper would be to simply turn off the decorator (and then manually convert the links that are interesting), or else write another decorator that strips out the broken URLs before getting to this one.

(As a slight aside: If I'm understanding this particular failure case correctly, we're already dealing with an absolute URL anyway, so the only issue is that we can't convert it to a properly encoded one. I suspect there's probably also a failure mode where the URL in question is a relative one, in which case we would probably want to do our best to convert it to an absolute one, even if we can't properly encode it, but we can certainly wait to see if such a case arises before actually trying to handle it.)