PyYoshi/cChardet

cChardet returns {'encoding': None, 'confidence': None} on very large file

AEHamrick opened this issue · 1 comments

OS/Arch


Python version

cChardet version

What is the problem?

When passing very large file to the .feed() function line by line cchardet is unable to determine the encoding.

head -X of my file shows that it's tab delimited text while file -i reports application/octet-stream; charset=binary --this may be my issue but I've read that file is bad at determining encoding so I'm not sure.

Expected behavior

That cchardet should return some non-None non-Unicode result e.g., ascii or win-1252 either before the end of the file or once it finishes.

Actual behavior

cchardet will seemingly consume the whole file but in the end return {'encoding': None, 'confidence': None}

Steps to reproduce the behavior

Sample data cannot be provided due to PII in the file, but I'm using this form:

import pathlib
import cchardet as chardet

target = pathlib.Path('\\\\path\\to\\file')

detector = chardet.UniversalDetector()
detector.reset()

i = 0
with open(target, "rb") as f:
    print(f'Reading {target}')
    for row in f:
        result = detector.feed(row)
        i +=1
        if i%10000 == 0:
            print(f'Line {i}')
        if detector.done:
            break
    print(detector.result)
stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.