Raise on character encoding errors
burlesona opened this issue · 2 comments
I've been using Reverse Markdown and it works great most of the time. I've run into one issue that I thought I'd get your opinion on.
Sometimes the HTML documents I'm converting have character encoding problems, leading to th dreaded Argument Error: invalid byte sequence in UTF-8
.
In other places I'm fixing this by coercing the lines of a file to UTF8 as I read them. I've discovered that when you parse a line you can generally just force_encoding
on it, and that will convert typographic marks and whatnot pretty well, but occasionally you'll run into issues where it's not enough and you have to be more aggressive, ie. the following:
def clean_line(line)
# encoding must be utf8, if non-utf8 characters are encountered we remove them.
# Weirdly though, this can fail, but then doesn't blow up until you call something else on the string...
line.force_encoding("UTF-8").strip # strip will make this raise if it didn't work
rescue
# ... in that case we want to selectively remove the offending characters.
line.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
end
I end up using this same code to scrub HTML before I enter it into ReverseMarkdown, but it would probably be more efficient to handle it inside the gem - and would save other people from this same headache.
Are you interested in handling encoding errors inside the gem? If yes, you can use that code, or I can try to circle back with a PR. If not, no worries, just thought it might be worth considering.
Thanks for a great gem!
Hey @burlesona,
Sorry for the late response!
It sure does sound like an interesting issue and might be worth solving within the gem, maybe with a flag to trigger it. Can you provide an example document that triggers the problem?
Thanks,
Jo
Hello @burlesona
please have a look at the PR and let me know what you think!