jawah/charset_normalizer

Q&A: truncated binary as input to detection

Closed this issue · 5 comments

Hi, congrats great library and very nice improvement on both accuracy and performance over classic chardet.

This is not a bug neither a feature request, it is more an "usage question".

Usually to detect encoding, it is desirable to sniff only first N bytes of file, then perform inference. This is convenient to avoid unnecesary IO bounds on file sniffing before real file load. As an example:

import charset_normalizer

with open(file_path, 'rb') as raw_data:
    bin_data = raw_data.read(n_bytes_to_sniff_encoding)   
best_detection_result = charset_normalizer.from_bytes(bin_data).best()
encoding = best_detection_result.encoding

The question is simple, and did not manage to find any reference respect to it.:

What happens if to charset_normalizer.from_bytes there is provided a bytes chain that is truncated in such a way last bytes do not represent a valid utf-8 (or other encoding) character?

It would be very nice to ignore last K bytes from inference (or to put this as a configurable parameter).

Also, is there any threshold on this amount of bytes to perform inference beyond of which the accuracy is not really improved consistently and could be set as "optimal" to this encoding detection?

Hello,

Glad to hear it is satisfactory.

Usually to detect encoding, it is desirable to sniff only first N bytes of file, then perform inference. This is convenient to avoid unnecesary IO bounds on file sniffing before real file load.

Good assertion.

What happens if to charset_normalizer.from_bytes there is provided a bytes chain that is truncated in such a way last bytes do not represent a valid utf-8 (or other encoding) character?

Running the detection on a broken bytes suite is not supported as of today. Mainly because we rely on the decoders to assess if we can reasonably return a guess without getting the end-user handling a UnicodeDecodeError.

Short answer: It will say NOT UTF-8 or Nothing.

We do not run the detection using all bytes, the main algorithm runs on smaller chunks and the performance concern should be minimal.

It would be very nice to ignore last K bytes from inference (or to put this as a configurable parameter).

Yes, there is a possibility of improvement in the following case: incomplete bytes sequence (ending truncated).

Also, is there any threshold on this amount of bytes to perform inference beyond of which the accuracy is not really improved consistently and could be set as "optimal" to this encoding detection?

For now, I recommend passing the whole content to avoid broken bytes suite and keeping the default kwarg in from_bytes(...).

Do this instead.

from charset_normalizer import from_path

guesses = from_path(file_path)

if guesses:
    best_detection_result = guesses.best()
    encoding = best_detection_result.encoding
  
    payload = best_detection_result.raw
    string = str(best_detection_result)

Hope that answers your questions.

thank you very much for the complete answer

Running the detection on a broken bytes suite is not supported as of today. Mainly because we rely on the decoders to assess if we can reasonably return a guess without getting the end-user handling a UnicodeDecodeError.

For what it’s worth, one way to handle this (inside the library) might be to do all the speculative decoding like this:

def decode_bytes(sequences: Union[bytes, bytearray], encoding: str, complete: bool):
    try:
        return str(sequences, encoding=encoding)
    except UnicodeDecodeError as error:
        # If the error is in the final code point, it might be incomplete.
        # Note: you could also check `"incomplete" in error.reason` to really know if
        # this is about an incomplete code point, but I think that might be likely to
        # break in other Python runtimes or future Python versions.
        if len(sequences) - error.start < max_bytes_per_point(encoding):
            return str(sequences[:error.start], encoding=encoding)

Or a little more fancy, integrated into Python’s decoding system, and probably more performant:

import codecs

def ignore_incomplete_final_code_point(error):
    # Same notes as above about optionally checking `error.reason`.
    if (
        isinstance(error, UnicodeDecodeError)
        and len(error.object) - error.start < max_bytes_per_point(error.encoding)
    ):
        return ('', error.end)

    raise error

codecs.register_error('ignore_incomplete_final_code_point', ignore_incomplete_final_code_point)

# Now to decode a buffer that might end in the middle of a code point:
str(sequences, encoding=encoding, errors='ignore_incomplete_final_code_point')

Both of those assume you have a function called max_bytes_per_point() that gets the largest possible number of bytes per code point in a given encoding (e.g. max_bytes_per_point('big5') == 2, max_bytes_per_point('utf-8') == 4), but you could also replace that function call with 4 (it’d be slightly less accurate, but probably good enough).

Ousret commented

Yes, you've got part of the thinking right. but unfortunately, it will require a lot more work.
We are working on a solution, but it takes time. it's halfway there.

It's been a while. But after consideration, I cannot pursue this implementation due to lack of time available.
If anyone want to tackle this, I am okay reviewing a PR that solve this.