Character encoding detecting library for JRuby using ICU4J.
- JRuby in 1.9 mode (or above)
First you'll need to require it
require 'charlock_holmes'
contents = File.read('test.xml')
detection = CharlockHolmes::EncodingDetector.detect(contents)
# => {:encoding => 'UTF-8', :confidence => 100}
# optionally there will be a :language key as well, but
# that's mostly only returned for legacy encodings like ISO-8859-1
NOTE: CharlockHolmes::EncodingDetector.detect
will return nil if it was unable to find an encoding.
Alternatively, you can just use the detect_encoding method on the String class
require 'charlock_holmes/string'
contents = File.read('test.xml')
detection = contents.detect_encoding
# => {:encoding => 'UTF-8', :confidence => 100}
# this will detect and set the encoding of `contents`, then return self
contents.detect_encoding!
Being able to detect the encoding of some arbitrary content is nice, but what you probably want is to be able to transcode that content into an encoding your application is using.
content = File.read('test2.txt')
detection = CharlockHolmes::EncodingDetector.detect(content)
utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'
The first parameter is the content to transcode, the second is the source encoding (the encoding the content is assumed to be in), and the third parameter is the destination encoding.
gem install charlock_holmes-jruby
The API and document is largely mimic the C-extension version of CharlockHolmes.
Copyright 2013 Francis Chong
Licensed under the MIT License.