Scan is not getting completed for file containing unicode characters like - 北京朝阳区

Question

Scan is not getting completed for file containing unicode characters like - 北京朝阳区

KBiru opened this issue 2 years ago · 4 comments

[This is an extension for issue no. #626 ]

Hi Team,

The scan is crashing for some files [without any errors or scan reports], I went through each line and found out that - if the code contains some string like - 北京朝阳区, detect-secret does not scan the file it exits without any errors. Is there some plugin or filters I should use to avoid this?
[Note - it is known that the particular file contains secret]

I mean are unicode strings getting handled properly? Also if I want to have should_exclude_secret filter for certain unicode regexes, then how to add it in the transient settings?
So far I could not do it.

Using the python package of the detect-secrets (python 3.10)
detect-secrets version = 1.4.0
OS = Windows 10

Please let me know if there is any information.

Thanks,
Bireswar

Answer 1 · 2023-03-22T20:17:00.000Z

@KBiru Hi. Thank you for reporting this. What is the file type of the file?

Answer 2 · 2023-03-23T03:55:50.000Z

The file type is normal XML, but the encoding is utf-8, so anything on utf-8 and containing characters like I mentioned, breaks scan process as detect-secrets take only default encoding of the OS it is running on, for example in windows it only tries to decode using cp1252 though the file is in utf-8.
I think this is causing the issue, it does not understand the file encoding and then process, rather it only takes care of the default encoding of the system.

Answer 3 · 2023-03-29T16:33:18.000Z

@KBiru Can you give me an example of a snippet of this file? For example trim the file down enough and sanitize it so there is not sensitive information while still causing the error. So I can attempt to reproduce this?

Answer 4 · 2023-04-18T08:01:04.000Z

@jpdakran I tried to create the same behavior as before but it seems like now it gives the following error -
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 352: character maps to

So, to be more clear the same content does work if the file encoding is same as system's default encoding but gives out error if it is different one -
to demonstrate the same use the following content and paste in a file encoded as utf-8 and the system must be windows -
[Expected behavior - it will fail as windows system's default encoding is cp1252 not utf-8]
`
+GSMSF18232735

  <td align="left" style="FONT-SIZE: 9px"><font size="-2"><a href="" target="URL">View URL</a></a></font></td>

  <td align="left"><font size="-1">勋</font></td>

  <td align="left"><font size="-1">张</font></td>

  <td align="left"><font size="-1">北京朝阳区</font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1">8.0</font></td>

  <td align="left"><font size="-1">东莞市悠派智能展示科技有限公司</font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1">茶山镇塘角工业区</font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1">东莞市</font></td>

  <td align="left"><font size="-1">CN</font></td>

  <td align="left"><font size="-1">523382</font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1">Professional Services</font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1">Marketing</font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1">No</font></td>

  <td align="left"><font size="-1">No</font></td>

  <td align="left"><font size="-1">No</font></td>

  <td align="left"><font size="-1">No</font></td>

  <td align="left"><font size="-1">Yes</font></td>

  <td align="left"><font size="-1">No</font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1"> </font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1"></font></td>

  <td align="left"><font size="-1"></font></td>

```
  </tr>`
```

What I wanted to discuss is that is the file encoding getting handled dynamically or this is not a feature for the tool yet.