jekyll/jekyll

Ruby 1.9 character encoding changes

Closed this issue · 15 comments

With Ruby 1.8, incorrect UTF-8 encoded characters are silently ignored. If you have a post with incorrect UTF-8 characters in the content body, they will show up in your rendered page as question marks (unknown characters).

A user upgrading from Ruby 1.8 to Ruby 1.9 who's site seemed to be working fine would get a weird error when trying to render their site (assuming it had incorrectly encoded UTF-8 characters):

/Users/blake/projects/jekyll/lib/jekyll/convertible.rb:26:in `read_yaml': invalid byte sequence in UTF-8
 (ArgumentError)
        from /Users/blake/projects/jekyll/lib/jekyll/post.rb:39:in `initialize'
        from /Users/blake/projects/jekyll/lib/jekyll/site.rb:110:in `new'
        from /Users/blake/projects/jekyll/lib/jekyll/site.rb:110:in `block in read_posts'
        from /Users/blake/projects/jekyll/lib/jekyll/site.rb:108:in `each'
        from /Users/blake/projects/jekyll/lib/jekyll/site.rb:108:in `read_posts'
        from /Users/blake/projects/jekyll/lib/jekyll/site.rb:169:in `read_directories'
        from /Users/blake/projects/jekyll/lib/jekyll/site.rb:79:in `read'
        from /Users/blake/projects/jekyll/lib/jekyll/site.rb:71:in `process'
        from ../jekyll/bin/jekyll:150:in `'

This doesn't really help the user fix the problem post. This commit will at least display the problem post so that the user knows what needs to be fixed for the site to render successfully.

This is mainly an issue of how Ruby decides to handle String encodings by default. You can read more about it here: http://blog.grayproductions.net/articles/ruby_19s_string

In my case i was getting the following error:

/usr/local/rvm/gems/ruby-1.9.1-p378/gems/jekyll-0.7.0/lib/jekyll/convertible.rb:26:in `read_yaml': invalid byte sequence in US-ASCII (ArgumentError)
    from /usr/local/rvm/gems/ruby-1.9.1-p378/gems/jekyll-0.7.0/lib/jekyll/page.rb:24:in `initialize'
    from /usr/local/rvm/gems/ruby-1.9.1-p378/gems/jekyll-0.7.0/lib/jekyll/site.rb:185:in `new'
    from /usr/local/rvm/gems/ruby-1.9.1-p378/gems/jekyll-0.7.0/lib/jekyll/site.rb:185:in `block in read_directories'
    from /usr/local/rvm/gems/ruby-1.9.1-p378/gems/jekyll-0.7.0/lib/jekyll/site.rb:175:in `each'

And solved the problem declaring the following locale in my shell:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

Just got bitten by this after recently switching to 1.9 as my default Ruby. Thanks for the patch.

I think I'm running into this problem, but only when running the jekyll command via SSH, not if I run jekyll directly on the host machine. Jekyll also runs without errors on the client machine — it's only over SSH that I encounter this problem:

/usr/local/lib/ruby/gems/1.9.1/gems/jekyll-0.10.0/lib/jekyll/convertible.rb:26:in `read_yaml': invalid byte sequence in US-ASCII (ArgumentError)
    from /usr/local/lib/ruby/gems/1.9.1/gems/jekyll-0.10.0/lib/jekyll/post.rb:39:in `initialize'
    from /usr/local/lib/ruby/gems/1.9.1/gems/jekyll-0.10.0/lib/jekyll/site.rb:119:in `new'
    from /usr/local/lib/ruby/gems/1.9.1/gems/jekyll-0.10.0/lib/jekyll/site.rb:119:in `block in read_posts'
    from /usr/local/lib/ruby/gems/1.9.1/gems/jekyll-0.10.0/lib/jekyll/site.rb:117:in `each'
    from /usr/local/lib/ruby/gems/1.9.1/gems/jekyll-0.10.0/lib/jekyll/site.rb:117:in `read_posts'
    from /usr/local/lib/ruby/gems/1.9.1/gems/jekyll-0.10.0/lib/jekyll/site.rb:211:in `read_directories'
    from /usr/local/lib/ruby/gems/1.9.1/gems/jekyll-0.10.0/lib/jekyll/site.rb:88:in `read'
    from /usr/local/lib/ruby/gems/1.9.1/gems/jekyll-0.10.0/lib/jekyll/site.rb:79:in `process'
    from /usr/local/lib/ruby/gems/1.9.1/gems/jekyll-0.10.0/bin/jekyll:164:in `<top (required)>'
    from /usr/local/bin/jekyll:19:in `load'
    from /usr/local/bin/jekyll:19:in `<main>'

I haven't tried lmmendes' fix yet (sorry, how/where do I declare those locales, and just on the host machine, or both?) but does anybody have any ideas why SSH is creating these problems?

Thanks.

Put these two lines to .bashrc:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

Thanks Kwpolska.

I ended up having to put those lines in my .profile, but they did the trick.

I just got the similar error.
My environment is Windows XP with ruby 1.9.2.
Any recommends under Windows?

Thanks.

@dengwh, for Windows set the same environment variables. In your cmd.exe, type

set LC_ALL=en_US.UTF-8
set LANG=en_US.UTF-8

@dengwh for windows you can use

chcp 65001  

seems connected to #117

I'm trying to get a post-receive hook to work on Arch Linux with Ruby 1.9 and I'm getting this ASCII error. I've tried adding the UTF-8 settings to my .profile, but I'm still getting the error. I assume the git hook doesn't use my .profile, though. Any further suggestions?

EDIT: I just applied to patch to this file and it works fine now. Duh... and Thank you!

connected to #226, #201

ehtb commented

This fix worked for me, whereas the others didn't: http://stackoverflow.com/a/8274677/1303499

I had a text file with a ü, but accidentally had it saved with ANSI encoding. Changing the encoding to UTF-8 fixed it for me. @stereobooster patch would be very helpful though.

Still getting errors but it just started out of nowhere:

/Users/kevinsuttle/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/jekyll-0.11.2/lib/jekyll/convertible.rb:29:in `read_yaml': invalid byte sequence in UTF-8 (ArgumentError)

This isn't new by the way. See issues 117, 188, 493, 135.

Merged in #718.

Liquid Exception: invalid byte sequence in UTF-8 in index.html