[BUG] Basic encoding failing and detecting as chinese, croatian and others when it is standard spanish

Question

[BUG] Basic encoding failing and detecting as chinese, croatian and others when it is standard spanish

nelsoni-talentu opened this issue a year ago · 2 comments

Describe the bug
When passes a bytes string encoded in different charsets from utf-8, it detects wrong charset and wrong language.
Behavior was detected on QA testing of a book text extractor module.

To Reproduce

from charset_normalizer import detect
bytes_str = "Probar los siguientes términos: ñandú, ándale, ésta, índice, cargó, último".encode("CP1252")
detect(bytes_str)

Output:

## First attempt
{'encoding': 'Big5', 'language': 'Chinese', 'confidence': 1.0} 

## Third attempt
{'encoding': 'mac_latin2', 'language': 'Croatian', 'confidence': 1.0}

Expected behavior

{'encoding': 'ISO-8859-1', 'language': 'Spanish', 'confidence': 1.0}

Desktop (please complete the following information):

OS: Windows 10 22H2
Python version: 3.10
Package version: 3.3.0

Answer 1 · 2023-10-19T05:58:58.000Z

You speak about a book but only give us a sentence.
The program does not yield different result across tries on the exact same bytes, so I suspect that you used different "sentences" or extract within the same book.
Immediately I suggest that you feed the detector with a bigger part of said book.

We cannot accurately propose an update without a significant file.

Answer 2 · 2023-10-31T20:20:01.000Z

Without proper data, I am obligated to close it in its current state.
Feel free to answer back when you can.