[BUG] Basic encoding failing and detecting as chinese, croatian and others when it is standard spanish
nelsoni-talentu opened this issue · 2 comments
Describe the bug
When passes a bytes string encoded in different charsets from utf-8, it detects wrong charset and wrong language.
Behavior was detected on QA testing of a book text extractor module.
To Reproduce
from charset_normalizer import detect
bytes_str = "Probar los siguientes términos: ñandú, ándale, ésta, índice, cargó, último".encode("CP1252")
detect(bytes_str)
Output:
## First attempt
{'encoding': 'Big5', 'language': 'Chinese', 'confidence': 1.0}
## Third attempt
{'encoding': 'mac_latin2', 'language': 'Croatian', 'confidence': 1.0}
Expected behavior
{'encoding': 'ISO-8859-1', 'language': 'Spanish', 'confidence': 1.0}
Desktop (please complete the following information):
- OS: Windows 10 22H2
- Python version: 3.10
- Package version: 3.3.0
You speak about a book but only give us a sentence.
The program does not yield different result across tries on the exact same bytes, so I suspect that you used different "sentences" or extract within the same book.
Immediately I suggest that you feed the detector with a bigger part of said book.
We cannot accurately propose an update without a significant file.
Without proper data, I am obligated to close it in its current state.
Feel free to answer back when you can.