basnijholt/rsync-time-machine.py

UTF-8 encoding error on MacOS

uglygus opened this issue · 9 comments

Python 3.11.3
MacOS 13.3.1 (22E261) running on M1 chip

A source folder containing a single file with a utf-8 character Mîso.rtf

$ ls TEST-UTF8
Mîso.rtf

$ mkdir -p -- "DEST" ; touch "DEST/backup.marker"

$ ./rsync-time-machine.py-main/rsync_time_machine.py   TEST-UTF8 DEST
rsync-time-machine.py: No previous backup - creating new one.
rsync-time-machine.py: Creating destination DEST/2023-05-13-104204
rsync-time-machine.py: Starting backup...
rsync-time-machine.py: From: TEST-UTF8/
rsync-time-machine.py: To:   DEST/2023-05-13-104204/
rsync-time-machine.py: Running command:
rsync-time-machine.py: rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --log-file '/Users/tbatters/.rsync-time-backup/2023-05-13-104204.log'  -- 'TEST-UTF8/' 'DEST/2023-05-13-104204/'
Traceback (most recent call last):
  File "/Users/tbatters/Downloads/./rsync-time-machine.py-main/rsync_time_machine.py", line 851, in <module>
    main()
  File "/Users/tbatters/Downloads/./rsync-time-machine.py-main/rsync_time_machine.py", line 833, in main
    backup(
  File "/Users/tbatters/Downloads/./rsync-time-machine.py-main/rsync_time_machine.py", line 794, in backup
    log_file = start_backup(
               ^^^^^^^^^^^^^
  File "/Users/tbatters/Downloads/./rsync-time-machine.py-main/rsync_time_machine.py", line 704, in start_backup
    run_cmd(cmd)
  File "/Users/tbatters/Downloads/./rsync-time-machine.py-main/rsync_time_machine.py", line 368, in run_cmd
    result = subprocess.run(
             ^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 550, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 1207, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 2113, in _communicate
    stdout = self._translate_newlines(stdout,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 1084, in _translate_newlines
    data = data.decode(encoding, errors)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 45: invalid continuation byte

Thank you so much for trying this package and reporting an issue.

I have opened #2 with a test.

Unfortunately I cannot reproduce your problem at the moment om my Mac:

rsync-time-machine.py on  fix-non-utf8 via 🐍 v3.11.0
❯ ls test/TEST-UTF8
Mîso.rtf

rsync-time-machine.py on  fix-non-utf8 via 🐍 v3.11.0
❯ ./rsync_time_machine.py test backup_local
rsync-time-machine.py: Previous backup found - doing incremental backup from /Users/basnijholt/Code/rsync-time-machine.py/backup_local/2023-05-13-104921
rsync-time-machine.py: Creating destination backup_local/2023-05-13-104951
rsync-time-machine.py: Expiring backup_local//2023-05-11-230856
rsync-time-machine.py: Starting backup...
rsync-time-machine.py: From: test/
rsync-time-machine.py: To:   backup_local/2023-05-13-104951/
rsync-time-machine.py: Running command:
rsync-time-machine.py: rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --log-file '/Users/basnijholt/.rsync-time-backup/2023-05-13-104951.log' --link-dest='/Users/basnijholt/Code/rsync-time-machine.py/backup_local/2023-05-13-104921' -- 'test/' 'backup_local/2023-05-13-104951/'
rsync-time-machine.py: Backup completed without errors.

rsync-time-machine.py on  fix-non-utf8 via 🐍 v3.11.0
❯ ls backup_local/2023-05-13-104951/TEST-UTF8
Mîso.rtf

I will keep investigating.

Could you retry with the latest main (I merged #2)?

Could you also run with -v (the verbose flag) then we will see the command it executes.

I can only recreate it when I create the ‘Mîso’ file using the Finder. The encoding is different when creating the file in the Finder vs the commandline. Detailed explanation in the first answer to this stack overflow question.

I have no idea why it is tripping up python though. This gist test-utf8.sh is a bash script that demonstrates the problem. It gives the following output:

./test-utf8.sh
n** First run with mkdir created folder only backups rsync_time_machine.py source test-utf8.sh
rsync-time-machine.py: No previous backup - creating new one.
rsync-time-machine.py: Creating destination backups/2023-05-17-003858
rsync-time-machine.py: Starting backup...
rsync-time-machine.py: From: source/
rsync-time-machine.py: To:   backups/2023-05-17-003858/
rsync-time-machine.py: Running command:
rsync-time-machine.py: rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --log-file '/Users/tbatters/.rsync-time-backup/2023-05-17-003858.log'  -- 'source/' 'backups/2023-05-17-003858/'
rsync-time-machine.py: Backup completed without errors.
In the finder window create a new folder
and name it Mîso2 by typing! Using option-i,i
for the second character.
Press any key...

rsync-time-machine.py: Running remote command: test -e 'source'
rsync-time-machine.py: Running remote command: find 'backups/backup.marker'
rsync-time-machine.py: Command output:
backups/backup.marker

rsync-time-machine.py: Running remote command: find 'backups/' -maxdepth 1 -type d -name '????-??-??-??????' -prune | sort -r
rsync-time-machine.py: Command output:
backups//2023-05-17-003858

rsync-time-machine.py: Running remote command: find 'backups/backup.inprogress'
rsync-time-machine.py: Command stderr:
find: backups/backup.inprogress: No such file or directory

rsync-time-machine.py: Running remote command: df -T 'source'
rsync-time-machine.py: Running remote command: df -T 'backups'
rsync-time-machine.py: Running remote command: cd 'backups//2023-05-17-003858';pwd
rsync-time-machine.py: Command output:
/Users/tbatters/BIG-STUFF/utf8-test/backups/2023-05-17-003858

rsync-time-machine.py: Previous backup found - doing incremental backup from /Users/tbatters/BIG-STUFF/utf8-test/backups/2023-05-17-003858
rsync-time-machine.py: Running remote command: find 'backups/2023-05-17-003910'
rsync-time-machine.py: Command stderr:
find: backups/2023-05-17-003910: No such file or directory

rsync-time-machine.py: Creating destination backups/2023-05-17-003910
rsync-time-machine.py: Running remote command: mkdir -p -- 'backups/2023-05-17-003910'
rsync-time-machine.py: Running remote command: find 'backups/' -maxdepth 1 -type d -name '????-??-??-??????' -prune | sort -r
rsync-time-machine.py: Command output:
backups//2023-05-17-003910
backups//2023-05-17-003858

rsync-time-machine.py: Starting backup...
rsync-time-machine.py: From: source/
rsync-time-machine.py: To:   backups/2023-05-17-003910/
rsync-time-machine.py: Running command:
rsync-time-machine.py: rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --log-file '/Users/tbatters/.rsync-time-backup/2023-05-17-003910.log' --link-dest='/Users/tbatters/BIG-STUFF/utf8-test/backups/2023-05-17-003858' -- 'source/' 'backups/2023-05-17-003910/'
rsync-time-machine.py: Running remote command: echo 83170 > backups/backup.inprogress
rsync-time-machine.py: Running remote command: rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --log-file '/Users/tbatters/.rsync-time-backup/2023-05-17-003910.log' --link-dest='/Users/tbatters/BIG-STUFF/utf8-test/backups/2023-05-17-003858' -- 'source/' 'backups/2023-05-17-003910/'
Traceback (most recent call last):
  File "/Users/tbatters/BIG-STUFF/utf8-test/./rsync_time_machine.py", line 847, in <module>
    main()
  File "/Users/tbatters/BIG-STUFF/utf8-test/./rsync_time_machine.py", line 829, in main
    backup(
  File "/Users/tbatters/BIG-STUFF/utf8-test/./rsync_time_machine.py", line 790, in backup
    log_file = start_backup(
               ^^^^^^^^^^^^^
  File "/Users/tbatters/BIG-STUFF/utf8-test/./rsync_time_machine.py", line 700, in start_backup
    run_cmd(cmd)
  File "/Users/tbatters/BIG-STUFF/utf8-test/./rsync_time_machine.py", line 365, in run_cmd
    log_info(f"Command output:\n{style(result.stdout, 'magenta', bold=True)}")
  File "/Users/tbatters/BIG-STUFF/utf8-test/./rsync_time_machine.py", line 56, in log_info
    log(message, "info")
  File "/Users/tbatters/BIG-STUFF/utf8-test/./rsync_time_machine.py", line 51, in log
    print(f"{style(APPNAME, bold=True)}: {levels[level]}{message}", file=output)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udccc' in position 81: surrogates not allowed
RESULT=1
^^^^ this should have failed
rsync-time-machine.py: Running remote command: test -e 'source'
rsync-time-machine.py: Running remote command: find 'backups/backup.marker'
rsync-time-machine.py: Command output:
backups/backup.marker

rsync-time-machine.py: Running remote command: find 'backups/' -maxdepth 1 -type d -name '????-??-??-??????' -prune | sort -r
rsync-time-machine.py: Command output:
backups//2023-05-17-003910
backups//2023-05-17-003858

rsync-time-machine.py: Running remote command: find 'backups/backup.inprogress'
rsync-time-machine.py: Command output:
backups/backup.inprogress

rsync-time-machine.py: Running remote command: cat backups/backup.inprogress
rsync-time-machine.py: Command output:
83170

rsync-time-machine.py: Running remote command: ps -p 83170 -o 'command' | grep 'rsync-time-machine.py'
rsync-time-machine.py: backups/backup.inprogress already exists - the previous backup failed or was interrupted. Backup will resume from there.
rsync-time-machine.py: Running remote command: mv -- backups//2023-05-17-003910 backups/2023-05-17-003911
rsync-time-machine.py: Running remote command: find 'backups/' -maxdepth 1 -type d -name '????-??-??-??????' -prune | sort -r
rsync-time-machine.py: Command output:
backups//2023-05-17-003911
backups//2023-05-17-003858

rsync-time-machine.py: Running remote command: echo 83189 > backups/backup.inprogress
rsync-time-machine.py: Running remote command: df -T 'source'
rsync-time-machine.py: Running remote command: df -T 'backups'
rsync-time-machine.py: Running remote command: cd 'backups//2023-05-17-003910';pwd
rsync-time-machine.py: Command output:
/Users/tbatters/BIG-STUFF/utf8-test

rsync-time-machine.py: Command stderr:
/bin/sh: line 0: cd: backups//2023-05-17-003910: No such file or directory

rsync-time-machine.py: Previous backup found - doing incremental backup from /Users/tbatters/BIG-STUFF/utf8-test
rsync-time-machine.py: Running remote command: find 'backups/2023-05-17-003911'
rsync-time-machine.py: Command output:
backups/2023-05-17-003911
backups/2023-05-17-003911/Mîso
backups/2023-05-17-003911/Mîso2

rsync-time-machine.py: Running remote command: find 'backups/' -maxdepth 1 -type d -name '????-??-??-??????' -prune | sort -r
rsync-time-machine.py: Command output:
backups//2023-05-17-003911
backups//2023-05-17-003858

rsync-time-machine.py: Starting backup...
rsync-time-machine.py: From: source/
rsync-time-machine.py: To:   backups/2023-05-17-003911/
rsync-time-machine.py: Running command:
rsync-time-machine.py: rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --log-file '/Users/tbatters/.rsync-time-backup/2023-05-17-003911.log' --link-dest='/Users/tbatters/BIG-STUFF/utf8-test' -- 'source/' 'backups/2023-05-17-003911/'
rsync-time-machine.py: Running remote command: echo 83189 > backups/backup.inprogress
rsync-time-machine.py: Running remote command: rsync -D --numeric-ids --links --hard-links --one-file-system --itemize-changes --times --recursive --perms --owner --group --stats --human-readable --log-file '/Users/tbatters/.rsync-time-backup/2023-05-17-003911.log' --link-dest='/Users/tbatters/BIG-STUFF/utf8-test' -- 'source/' 'backups/2023-05-17-003911/'
rsync-time-machine.py: Command output:

Number of files: 3
Number of files transferred: 0
Total file size: 0 bytes
Total transferred file size: 0 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 60
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 76
Total bytes received: 20

sent 76 bytes  received 20 bytes  192.00 bytes/sec
total size is 0  speedup is 0.00

rsync-time-machine.py: Backup completed without errors.
rsync-time-machine.py: Running remote command: rm -f -- 'backups/latest'
rsync-time-machine.py: Running remote command: ln -s -- '2023-05-17-003911' 'backups/latest'
rsync-time-machine.py: Running remote command: rm -f -- 'backups/backup.inprogress'
RESULT=0
^^^^ this should have succeeded

I forgot to add the gist repeats the final command twice to demonstrate the odd behavior that it alternates between succeeding and failing on subsequent runs. The actual backups are performed correctly in both cases though I believe.

Thanks a lot for replying back! I still am not able to reproduce the issue, even when manually generating the file.

ChatGPT tells me:

The error you're encountering seems to be related to how different systems handle Unicode characters in filenames. The filename "Mîso" can be represented in Unicode in two different ways:

  • A single Unicode character for "î" (U+00EE LATIN SMALL LETTER I WITH CIRCUMFLEX)
  • Two Unicode characters: "i" (U+0069 LATIN SMALL LETTER I) followed by "^" (U+0302 COMBINING CIRCUMFLEX ACCENT)

The Finder on macOS uses the second representation, which is called a decomposed form (NFD), where the base character and the accent are stored separately. This is different from the composed form (NFC) where the base character and the accent are stored together as a single Unicode character.

However, also that it might depend on the filesystem.

I am trying to reproduce the error here #9 but regardless of encoding, the file has the same name once saved.

I will keep looking.

It seems like the code is only failing in logging. Perhaps the string sanitation I added in #9 is sufficient to make this work?

That works.

OK great! Thanks so much for your patience and reporting back. I merged the PR :)