cardmagic/simple-rss

description tag should be in UTF-8 encoding but it is in ASCII-8BIT

Opened this issue · 9 comments

Tried this also:

l.description.force_encoding('UTF-8').encode!('UTF-8',:invalid => :replace,:replace => '')

But still ending up with:
Uncaught exception: invalid byte sequence in UTF-8

Using

l.description.to_s.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => ''})

solves the issue but we lose the original UNICODE character that was in the source.

Got same issue

There is content.force_encoding('binary') in the if condition:

 def unescape(content)
    if content.respond_to?(:force_encoding) && content.force_encoding("binary") =~ /([^-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]%)/n then
        CGI.unescape(content).gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    else
        content.gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    end
  end

force_encoding method changes string encoding inplace, so every string returned by simple-rss will be encoded to ASCII 8-bit...

I'd rewrite that the following way, but unsure that for this 'if' as well. So I don't make a pull request.

  def unescape(content)
    if content.respond_to?(:force_encoding) && encode_binary(content) =~ /([^-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]%)/n then
        CGI.unescape(content).gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    else
        content.gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    end
  end

  def encode_binary(content)
    content.encode('binary', {:invalid => :replace, :undef => :replace, :replace => ''})
  end

Hi @evgeniynickolaev can you please test it with a feed that has non latin characters? Meanwhile I will try to post a sample where it failed for me.

Yes, I've tested it with a feed containing the following unicode symbols - \xE2\x80\x99.
But not sure it is 100% correct as not fully understand the logic if this unescaping.

Just as @evgeniynickolaev pointed out, the immediate source of the problem is force_encoding("binary"), which (even though the name does not end in bang) mutates the string object in place. However, apparetly the reason for adding the force_encoding was "n" flag in the regexp within the conditional introduced in ac95fb4. It says that the regex should be interpreted as binary (ASCII-8BIT) no matter what the source encoding is (see http://www.ruby-doc.org/core-2.1.3/Regexp.html#class-Regexp-label-Encoding).

I'll throw in a fix which simply removes all the fiddling with encodings. I can't figure out any reason why there would be any need for that.

I run into the same problem. This gem is not well maintained. I'm go with other gems.

@chengguangnan what other gem have you found that is well maintained?

Hi @jeremyhaile, I switched to feedjira.