description tag should be in UTF-8 encoding but it is in ASCII-8BIT
Opened this issue · 9 comments
Tried this also:
l.description.force_encoding('UTF-8').encode!('UTF-8',:invalid => :replace,:replace => '')
But still ending up with:
Uncaught exception: invalid byte sequence in UTF-8
Using
l.description.to_s.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => ''})
solves the issue but we lose the original UNICODE character that was in the source.
Got same issue
There is content.force_encoding('binary') in the if condition:
def unescape(content)
if content.respond_to?(:force_encoding) && content.force_encoding("binary") =~ /([^-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]%)/n then
CGI.unescape(content).gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
else
content.gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
end
end
force_encoding method changes string encoding inplace, so every string returned by simple-rss will be encoded to ASCII 8-bit...
I'd rewrite that the following way, but unsure that for this 'if' as well. So I don't make a pull request.
def unescape(content)
if content.respond_to?(:force_encoding) && encode_binary(content) =~ /([^-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]%)/n then
CGI.unescape(content).gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
else
content.gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
end
end
def encode_binary(content)
content.encode('binary', {:invalid => :replace, :undef => :replace, :replace => ''})
end
Hi @evgeniynickolaev can you please test it with a feed that has non latin characters? Meanwhile I will try to post a sample where it failed for me.
Yes, I've tested it with a feed containing the following unicode symbols - \xE2\x80\x99.
But not sure it is 100% correct as not fully understand the logic if this unescaping.
Just as @evgeniynickolaev pointed out, the immediate source of the problem is force_encoding("binary")
, which (even though the name does not end in bang) mutates the string object in place. However, apparetly the reason for adding the force_encoding
was "n" flag in the regexp within the conditional introduced in ac95fb4. It says that the regex should be interpreted as binary (ASCII-8BIT) no matter what the source encoding is (see http://www.ruby-doc.org/core-2.1.3/Regexp.html#class-Regexp-label-Encoding).
I'll throw in a fix which simply removes all the fiddling with encodings. I can't figure out any reason why there would be any need for that.
I run into the same problem. This gem is not well maintained. I'm go with other gems.
@chengguangnan what other gem have you found that is well maintained?
Hi @jeremyhaile, I switched to feedjira.