description tag should be in UTF-8 encoding but it is in ASCII-8BIT

Question

description tag should be in UTF-8 encoding but it is in ASCII-8BIT

Opened this issue 11 years ago · 9 comments

Tried this also:

l.description.force_encoding('UTF-8').encode!('UTF-8',:invalid => :replace,:replace => '')

But still ending up with:
Uncaught exception: invalid byte sequence in UTF-8

Answer 1 · 2014-01-20T12:56:35.000Z

Using

l.description.to_s.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => ''})

solves the issue but we lose the original UNICODE character that was in the source.

Answer 2 · 2014-02-24T07:02:07.000Z

Got same issue

Answer 3 · 2014-02-24T07:35:47.000Z

There is content.force_encoding('binary') in the if condition:

 def unescape(content)
    if content.respond_to?(:force_encoding) && content.force_encoding("binary") =~ /([^-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]%)/n then
        CGI.unescape(content).gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    else
        content.gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    end
  end

force_encoding method changes string encoding inplace, so every string returned by simple-rss will be encoded to ASCII 8-bit...

I'd rewrite that the following way, but unsure that for this 'if' as well. So I don't make a pull request.

  def unescape(content)
    if content.respond_to?(:force_encoding) && encode_binary(content) =~ /([^-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]%)/n then
        CGI.unescape(content).gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    else
        content.gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    end
  end

  def encode_binary(content)
    content.encode('binary', {:invalid => :replace, :undef => :replace, :replace => ''})
  end

Answer 4 · 2014-02-24T07:43:43.000Z

Hi @evgeniynickolaev can you please test it with a feed that has non latin characters? Meanwhile I will try to post a sample where it failed for me.

Answer 5 · 2014-02-24T07:47:21.000Z

Yes, I've tested it with a feed containing the following unicode symbols - \xE2\x80\x99.
But not sure it is 100% correct as not fully understand the logic if this unescaping.

Answer 6 · 2014-10-02T22:17:06.000Z

Just as @evgeniynickolaev pointed out, the immediate source of the problem is force_encoding("binary"), which (even though the name does not end in bang) mutates the string object in place. However, apparetly the reason for adding the force_encoding was "n" flag in the regexp within the conditional introduced in ac95fb4. It says that the regex should be interpreted as binary (ASCII-8BIT) no matter what the source encoding is (see http://www.ruby-doc.org/core-2.1.3/Regexp.html#class-Regexp-label-Encoding).

I'll throw in a fix which simply removes all the fiddling with encodings. I can't figure out any reason why there would be any need for that.

Answer 7 · 2015-07-04T05:12:14.000Z

I run into the same problem. This gem is not well maintained. I'm go with other gems.

Answer 8 · 2015-07-09T15:19:05.000Z

@chengguangnan what other gem have you found that is well maintained?

Answer 9 · 2015-07-10T06:31:45.000Z

Hi @jeremyhaile, I switched to feedjira.