commoncrawl/ia-web-commons

Add encoding detection to WET text extraction

Closed this issue · 1 comments

The WET text extraction assumes always UTF-8, it should rely on a robust charset detection instead. See the discussions .wet file encoding and problem with East European encodings in WET files.

Tested with a set of sample WARC files (wet_encoding_test.zip) - Japanese, Polish, Czech, Hungarian, Russian, Turkish, German with various encodings.