Decodetect is a text encoding detection library designed to support encodings that many other libraries don't. It contains the infrastructure to train and test custom models, and everything is written in pure Java to maximize portability.
Models encode byte bigram frequency counts. At runtime, input data is converted to this same byte bigram frequency format and compared with the trained models via cosine similarity.
The training data that creates the distributed model is gathered through Wikipedia (see module train
). However, it is possible to supply one's own training data and train a more specialized model as well.
Decodetect can be found at Maven Central.
Using Decodetect involves simply creating an instance of Decodetect
and then passing a byte[]
to getResults()
:
byte[] rawBytes = Files.readAllBytes(somePath);
Decodetect decodetect = new Decodetect();
DecodetectResult topResult = decodetect.getResults(rawBytes).get(0);
Charset detectedCharset = topResult.getEncoding();
String decoded = new String(rawBytes, detectedCharset);
Each DecodetectResult
contains a confidence number in addition to the Charset
itself. This is a measure of how similar the input bytes represent the model trained on the encoding. For most use cases, one can just use the first item in the result list.
Decodetect supports a myriad of encodings for many languages. The bundled model has specific encodings for each language, but all languages support the following encodings as well:
- UTF-7
- UTF-8
- UTF-16 BE
- UTF-16 LE
- UTF-32 BE
- UTF-32 LE
For more information on the encodings and languages supported by Decodetect, see Encodings.java.
Decodetect can be built simply with maven. The modules are as follows:
-
core
Contains runtime dependencies -
train
For downloading training data and training models
Runtime:
Training:
- gson for parsing json to extract text from Wikipedia (Apache 2.0)
Testing:
Decodetect was written by Ethan Roseman and uses the MIT license. See the license for more information.