encoding-recurrent: A Python repository from robert-d-schultz

PyTorch LSTM neural networks that classify byte sequences as either UTF-8 or Windows-1252.

A training and an evaluation set was created from the News Crawl 2009 corpus (available http://www.statmt.org/wmt11/translation-task.html). This corpus is created from English-language newswire data.

Chardet (version 4.0.0), an existing Python character encoding detector, was used for comparison.

Select results:

32 hidden unit model accuracy: 0.9770

Chardet accuracy: 0.7340

robert-d-schultz/encoding-recurrent