jawah/charset_normalizer

[BUG] Basic encoding failing and detecting as chinese, croatian and others when it is standard spanish

nelsoni-talentu opened this issue · 2 comments

Describe the bug
When passes a bytes string encoded in different charsets from utf-8, it detects wrong charset and wrong language.
Behavior was detected on QA testing of a book text extractor module.

To Reproduce

from charset_normalizer import detect
bytes_str = "Probar los siguientes términos: ñandú, ándale, ésta, índice, cargó, último".encode("CP1252")
detect(bytes_str)

Output:

## First attempt
{'encoding': 'Big5', 'language': 'Chinese', 'confidence': 1.0} 

## Third attempt
{'encoding': 'mac_latin2', 'language': 'Croatian', 'confidence': 1.0}

Expected behavior

{'encoding': 'ISO-8859-1', 'language': 'Spanish', 'confidence': 1.0}

Desktop (please complete the following information):

  • OS: Windows 10 22H2
  • Python version: 3.10
  • Package version: 3.3.0
Ousret commented

You speak about a book but only give us a sentence.
The program does not yield different result across tries on the exact same bytes, so I suspect that you used different "sentences" or extract within the same book.
Immediately I suggest that you feed the detector with a bigger part of said book.

We cannot accurately propose an update without a significant file.

Ousret commented

Without proper data, I am obligated to close it in its current state.
Feel free to answer back when you can.