rjust/defects4j

Patches cannot be decoded using utf-8

Closed this issue · 2 comments

I'm not able to read patches for Jsoup 52, Compress 7, and Lang 25 using utf-8. I get the following error:

  File "/home/tschweiz/.pyenv/versions/3.8.15/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 13981: invalid continuation byte

This only happens on these patches. No other bug patches have this issue. Is it the intended behavior because these patches have special characters or is the encoding incorrect?

Replication

  1. Install Python package (https://pypi.org/project/unidiff/)[unidiff]
  2. Read patch with unidiff: PatchSet.from_filename(path/to/patch)
jose commented

Hi @Thomsch,

The bug-mining procedure collects the diffs as they were in the history of the repository (git, svn). Whether unidiff is or is not able to parse those out-of-the-box is out of Defects4J's scope, in my opinion.

--
Best,
Jose

Hi @jose, thank you for the reply. I think that is a fair assessment 👍.

We solved this issue by reading the diffs using the latin-1 encoding if someone ever has this issue in the future.