/decodetect

Java text encoding detection library

Primary LanguageJavaMIT LicenseMIT

Decodetect

Decodetect is a text encoding detection library designed to support encodings that many other libraries don't. It contains the infrastructure to train and test custom models, and everything is written in pure Java to maximize portability.

Models encode byte bigram frequency counts. At runtime, input data is converted to this same byte bigram frequency format and compared with the trained models via cosine similarity.

The training data that creates the distributed model is gathered through Wikipedia (see module train). However, it is possible to supply one's own training data and train a more specialized model as well.

Usage

Decodetect can be found at Maven Central.

Using Decodetect involves simply creating an instance of Decodetect and then passing a byte[] to getResults():

byte[] rawBytes = Files.readAllBytes(somePath);

Decodetect decodetect = new Decodetect();
DecodetectResult topResult = decodetect.getResults(rawBytes).get(0);
Charset detectedCharset = topResult.getEncoding();

String decoded = new String(rawBytes, detectedCharset);

Each DecodetectResult contains a confidence number in addition to the Charset itself. This is a measure of how similar the input bytes represent the model trained on the encoding. For most use cases, one can just use the first item in the result list.

Supported Encodings

Decodetect supports a myriad of encodings for many languages. The bundled model has specific encodings for each language, but all languages support the following encodings as well:

  • UTF-7
  • UTF-8
  • UTF-16 BE
  • UTF-16 LE
  • UTF-32 BE
  • UTF-32 LE

For more information on the encodings and languages supported by Decodetect, see Encodings.java.

Project Structure

Decodetect can be built simply with maven. The modules are as follows:

  • core Contains runtime dependencies

  • train For downloading training data and training models

Dependencies

Runtime:

Training:

Testing:

About

Decodetect was written by Ethan Roseman and uses the MIT license. See the license for more information.