cChardet returns {'encoding': None, 'confidence': None} on very large file

Question

cChardet returns {'encoding': None, 'confidence': None} on very large file

AEHamrick opened this issue 4 years ago · 1 comments

AEHamrick commented 4 years ago

OS/Arch

Python version

cChardet version

What is the problem?

When passing very large file to the .feed() function line by line cchardet is unable to determine the encoding.

head -X of my file shows that it's tab delimited text while file -i reports application/octet-stream; charset=binary --this may be my issue but I've read that file is bad at determining encoding so I'm not sure.

Expected behavior

That cchardet should return some non-None non-Unicode result e.g., ascii or win-1252 either before the end of the file or once it finishes.

Actual behavior

cchardet will seemingly consume the whole file but in the end return {'encoding': None, 'confidence': None}

Steps to reproduce the behavior

Sample data cannot be provided due to PII in the file, but I'm using this form:

import pathlib
import cchardet as chardet

target = pathlib.Path('\\\\path\\to\\file')

detector = chardet.UniversalDetector()
detector.reset()

i = 0
with open(target, "rb") as f:
    print(f'Reading {target}')
    for row in f:
        result = detector.feed(row)
        i +=1
        if i%10000 == 0:
            print(f'Line {i}')
        if detector.done:
            break
    print(detector.result)

Answer 1 · 2022-04-17T06:12:43.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.