palewire/savepagenow

Force encoding to skip chardet

nyanbinaryneko opened this issue · 1 comments

I'm doing a rather large archival project. A bottleneck I have encountered is chardet trying to guess encoding, when all I need is the returned URL. One way around this, is to optionally set the encoding on a request via Requests. Here's an example of the returned spam:

2019-04-09 22:33:27 [chardet.charsetprober] DEBUG: utf-8  confidence = 0.7525
2019-04-09 22:33:27 [chardet.charsetprober] DEBUG: SHIFT_JIS Japanese confidence = 0.01
2019-04-09 22:33:27 [chardet.charsetprober] DEBUG: EUC-JP Japanese confidence = 0.01
2019-04-09 22:33:27 [chardet.charsetprober] DEBUG: GB2312 Chinese confidence = 0.01
2019-04-09 22:33:27 [chardet.charsetprober] DEBUG: EUC-KR Korean confidence = 0.01
2019-04-09 22:33:27 [chardet.charsetprober] DEBUG: CP949 Korean confidence = 0.01
2019-04-09 22:33:27 [chardet.charsetprober] DEBUG: Big5 Chinese confidence = 0.01
2019-04-09 22:33:27 [chardet.charsetprober] DEBUG: EUC-TW not active

If all someone is doing is returning the URL, we can skip this step in Requests.

I don't know how to do this so I'm going to close the ticket. If you have a pull request you can make I'd consider it. Thanks.