Add encoding detection to WET text extraction
Closed this issue · 1 comments
sebastian-nagel commented
The WET text extraction assumes always UTF-8, it should rely on a robust charset detection instead. See the discussions .wet file encoding and problem with East European encodings in WET files.
sebastian-nagel commented
Tested with a set of sample WARC files (wet_encoding_test.zip) - Japanese, Polish, Czech, Hungarian, Russian, Turkish, German with various encodings.