Add encoding detection to WET text extraction

Question

Closed this issue 8 years ago · 1 comments

The WET text extraction assumes always UTF-8, it should rely on a robust charset detection instead. See the discussions .wet file encoding and problem with East European encodings in WET files.

Answer 1 · 2016-11-24T11:01:39.000Z

Tested with a set of sample WARC files (wet_encoding_test.zip) - Japanese, Polish, Czech, Hungarian, Russian, Turkish, German with various encodings.