Don't crash on non-unicode files
simonw opened this issue · 3 comments
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 10
Got this error running against a folder with a binary in it.
>>> import pdb
>>> pdb.pm()
> /opt/homebrew/Caskroom/miniconda/base/lib/python3.10/codecs.py(322)decode()
-> (result, consumed) = self._buffer_decode(data, self.errors, final)
(Pdb) u
> /Users/simon/.local/pipx/venvs/files-to-prompt/lib/python3.10/site-packages/files_to_prompt/cli.py(66)process_path()
-> file_contents = f.read()
(Pdb) list
61 ]
62
63 for file in files:
64 file_path = os.path.join(root, file)
65 with open(file_path, "r") as f:
66 -> file_contents = f.read()
67
68 click.echo(file_path)
69 click.echo("---")
70 click.echo(file_contents)
71 click.echo()
Easiest option: silently ignore files that cannot be treated as UTF-8 (maybe showing a warning).
But what if users want to run this against files with different encodings? For the moment I'll leave them to convert those files themselves, future releases might add some kind of supported encoding option.
Easy way to replicate this problem in the files-to-prompt
checkout itself:
python -m pip install build
python -m build
files-to-prompt .
It crashes on the binary wheel that was built and dropped into dist/
.
files-to-prompt files_to_prompt/cli.py | llm -m opus --system \
'catch unicodedecodeerror reading the file and output a click warning about the file, skipping it and moving on'
Took a few follow-ups:
llm -c 'remember to use err=True on those click echo lines'
llm -c 'How would I show those in a different color?'
https://gist.github.com/simonw/9b83f42a1b87d3fcb3b4b8e6f482af38
Then to get it to write the tests:
git diff > diff.txt
files-to-prompt diff.txt tests/test_files_to_prompt.py | llm -m opus -s \
'output one more test that can exercise the new code that writes warnings about binary files'
llm -c 'modify that test to capture stdout and stderr separately and check for the message in stderr'
llm -c 'ValueError: stderr not separately captured'
llm -c "TypeError: CliRunner.__init__() got an unexpected keyword argument 'stderr'"
# I had to give it a clue:
llm -c 'Use CliRunner(mix_stderr=False)'
https://gist.github.com/simonw/511e1dbede6aba25b2d7027c55cdf759
The test it added failed, because it turned out it had tried writing a binary string b"\x00\x01\x02\x03\x04\x05"
which decoded as utf-8
. I switched that out for \xff
instead.