cChardet returns {'encoding': None, 'confidence': None} on very large file
AEHamrick opened this issue · 1 comments
OS/Arch
Python version
cChardet version
What is the problem?
When passing very large file to the .feed() function line by line cchardet is unable to determine the encoding.
head -X
of my file shows that it's tab delimited text while file -i
reports application/octet-stream; charset=binary
--this may be my issue but I've read that file
is bad at determining encoding so I'm not sure.
Expected behavior
That cchardet should return some non-None non-Unicode result e.g., ascii or win-1252 either before the end of the file or once it finishes.
Actual behavior
cchardet will seemingly consume the whole file but in the end return {'encoding': None, 'confidence': None}
Steps to reproduce the behavior
Sample data cannot be provided due to PII in the file, but I'm using this form:
import pathlib
import cchardet as chardet
target = pathlib.Path('\\\\path\\to\\file')
detector = chardet.UniversalDetector()
detector.reset()
i = 0
with open(target, "rb") as f:
print(f'Reading {target}')
for row in f:
result = detector.feed(row)
i +=1
if i%10000 == 0:
print(f'Line {i}')
if detector.done:
break
print(detector.result)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.