simonw/files-to-prompt

Don't crash on non-unicode files

simonw opened this issue · 3 comments

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 10

Got this error running against a folder with a binary in it.

>>> import pdb
>>> pdb.pm()
> /opt/homebrew/Caskroom/miniconda/base/lib/python3.10/codecs.py(322)decode()
-> (result, consumed) = self._buffer_decode(data, self.errors, final)
(Pdb) u
> /Users/simon/.local/pipx/venvs/files-to-prompt/lib/python3.10/site-packages/files_to_prompt/cli.py(66)process_path()
-> file_contents = f.read()
(Pdb) list
 61  	                ]
 62  	
 63  	            for file in files:
 64  	                file_path = os.path.join(root, file)
 65  	                with open(file_path, "r") as f:
 66  ->	                    file_contents = f.read()
 67  	
 68  	                click.echo(file_path)
 69  	                click.echo("---")
 70  	                click.echo(file_contents)
 71  	                click.echo()

Easiest option: silently ignore files that cannot be treated as UTF-8 (maybe showing a warning).

But what if users want to run this against files with different encodings? For the moment I'll leave them to convert those files themselves, future releases might add some kind of supported encoding option.

Easy way to replicate this problem in the files-to-prompt checkout itself:

python -m pip install build
python -m build
files-to-prompt .

It crashes on the binary wheel that was built and dropped into dist/.

files-to-prompt files_to_prompt/cli.py | llm -m opus --system \
  'catch unicodedecodeerror reading the file and output a click warning about the file, skipping it and moving on'

Took a few follow-ups:

llm -c 'remember to use err=True on those click echo lines'
llm -c 'How would I show those in a different color?'

https://gist.github.com/simonw/9b83f42a1b87d3fcb3b4b8e6f482af38

Then to get it to write the tests:

git diff > diff.txt
files-to-prompt diff.txt tests/test_files_to_prompt.py | llm -m opus -s \
  'output one more test that can exercise the new code that writes warnings about binary files'
llm -c 'modify that test to capture stdout and stderr separately and check for the message in stderr'
llm -c 'ValueError: stderr not separately captured'
llm -c "TypeError: CliRunner.__init__() got an unexpected keyword argument 'stderr'"
# I had to give it a clue:
llm -c 'Use CliRunner(mix_stderr=False)'

https://gist.github.com/simonw/511e1dbede6aba25b2d7027c55cdf759

The test it added failed, because it turned out it had tried writing a binary string b"\x00\x01\x02\x03\x04\x05" which decoded as utf-8. I switched that out for \xff instead.