UnconnectedBedna/shrink-backup

Bug in shrink-backup

Dylanroden94 opened this issue · 11 comments

Hello!

We are running into an issue of some of our devices with a no space left on the device.

562.33M 6% 17.66MB/s 0:07:31 rsync: [receiver] write failed on "/tmp/backup-Qzjbebgh3H/var/log/lightdm/lightdm.log": No space left on device (28)

If I run a df command on the device I see a ton of free space.

/dev/root 30340960 7986796 21062172 28% /
devtmpfs 1678472 0 1678472 0% /dev
tmpfs 1942696 0 1942696 0% /dev/shm
tmpfs 777080 1072 776008 1% /run
tmpfs 5120 4 5116 1% /run/lock
/dev/mmcblk0p1 261108 32458 228650 13% /boot
tmpfs 388536 16 388520 1% /run/user/1000
/dev/loop0p2 7326828 8 6933232 1% /tmp/backup-Tg7jtvayTS

Any suggestions?

Thank you for making an issue.

I have ideas, but the easiest way forward is if you provide a complete log by running the script again and include the -l tag.
Remove all mounts and loops (if unsure how, reboot system and they will be gone).
You can either attach the shrink-backup.log file or just copy paste the logs from the latest backup.

The temp directories are not the same from the error and the df command.
The loop is ending with "p2" so I suspect you have a boot partition too, It might be that the backup is somehow running against your boot partition, and that is usually not very big.
All temp directories should also be unmounted and removed when the script fails, same with loops, so there is something fishy going on here.

/dev/root 30340960 7986796 21062172 28% / might also be a problem (/dev/mmcblk0p2 would be a more normal mount here), but we look into it after I have had a peek at the log.
What os are you using?

So please provide log and we move forward. 🙂

I appreciate the help! Please see below.

2024-02-29 22:21:30 [INFO]  - Debugging requested, writing to log file /home/pi/shrink-backup.log
2024-02-29 22:21:31 [INFO]  - ext4 root filesystem detected
2024-02-29 22:21:31 [DEBUG] - FSTYPE=ext4
2024-02-29 22:21:31 [DEBUG] - LOCAL_DEV_PTUUID=36f759f9 | LOCAL_DEV_PATH=/dev/mmcblk0
2024-02-29 22:21:31 [DEBUG] - PARTITION_TABLE=msdos
2024-02-29 22:21:31 [DEBUG] - Update existing img file, UPDATE=true
2024-02-29 22:21:31 [DEBUG] - Requesting size from resize2fs, RESIZE2FS_RUN=true
2024-02-29 22:21:31 [DEBUG] - Prompt for user confirmation, PROMPTS=false
2024-02-29 22:21:31 [DEBUG] - Auto expansion, AUTOEXPAND=true
2024-02-29 22:21:31 [INFO]  - -a selected by user
2024-02-29 22:21:31 [DEBUG] - ADDED_SPACE=0
2024-02-29 22:21:31 [INFO]  - -U sekected by user, executing do_backup function
------------------------------------------------------------------------------
2024-02-29 22:21:31 [INFO]  - Running function: get_shared_variables
2024-02-29 22:21:31 [DEBUG] - LOOP=/dev/loop1
2024-02-29 22:21:31 [INFO]  - Separate boot partition detected
2024-02-29 22:21:31 [DEBUG] - Running: cat /etc/fstab | grep '/boot' | awk '{print $2}'
2024-02-29 22:21:31 [DEBUG] - BOOT_PATH=/boot | IMG_DEV_BOOT_PATH=/dev/loop1p1 | IMG_DEV_ROOT_PATH=/dev/loop1p2
2024-02-29 22:21:31 [INFO]  - Running function: do_loop
2024-02-29 22:21:32 [DEBUG] - Running: losetup -P /dev/loop1 /mnt/raisin/backups/pi/10.0.139.144.img
2024-02-29 22:21:32 [INFO]  - Running function: get_img_variables
2024-02-29 22:21:32 [DEBUG] - Running: ls -l /mnt/raisin/backups/pi/10.0.139.144.img | awk '{print $5}'
2024-02-29 22:21:32 [DEBUG] - IMG_SIZE=8912166400 Bytes
2024-02-29 22:21:32 [INFO]  - Running function: get_dev_variables
2024-02-29 22:21:32 [INFO]  - Separate boot partition detected
2024-02-29 22:21:32 [DEBUG] - LOCAL_DEV_BOOT_PATH=/dev/mmcblk0p1 | LOCAL_DEV_ROOT_PATH=/dev/mmcblk0p2
2024-02-29 22:21:33 [DEBUG] - LOCAL_ROOT_PARTN=2
2024-02-29 22:21:33 [INFO]  - Calculating size for dd to cover bootsector in blocks (512B block size) and adding 256 blocks to overlap into root (only used in img creation)
2024-02-29 22:21:33 [DEBUG] - Running: fdisk -lo start /dev/mmcblk0 | tail -1
2024-02-29 22:21:33 [DEBUG] - LOCAL_ROOT_START=532480 Blocks | LOCAL_BOOTSECTOR=532479 Blocks | LOCAL_DDBOOTSECTOR=532735 Blocks
2024-02-29 22:21:33 [DEBUG] - LOCAL_ROOT_START=272629760 Bytes | LOCAL_BOOTSECTOR=272629248 Bytes
2024-02-29 22:21:33 [DEBUG] - LOCAL_RESIZE2FS_MIN=8698839040 Bytes
2024-02-29 22:21:33 [INFO]  - Setting TOTAL (space needed for files on root) to size calculated by resize2fs
2024-02-29 22:21:33 [DEBUG] - TOTAL=8698839040 Bytes
2024-02-29 22:21:33 [INFO]  - Calculating .img file size by adding LOCAL_BOOTSECTOR to TOTAL (only used in img creation)
2024-02-29 22:21:33 [DEBUG] - TRUNCATE_TOTAL=8971468288 Bytes
2024-02-29 22:21:33 [DEBUG] - Running: fdisk --bytes -lo size /dev/loop1 | tail -1
2024-02-29 22:21:33 [DEBUG] - IMG_ROOT_SIZE=8639536640 Bytes | LOCAL_RESIZE2FS_MIN=8698839040 Bytes | diff=56 MB
2024-02-29 22:21:39 [INFO]  - 6 seconds passed, user did not stop operation
2024-02-29 22:21:39 [INFO]  - Img root partition is <=256MB smaller or <=512MB bigger compared to auto calculated size, not resizing
2024-02-29 22:21:40 [INFO]  - Running function: do_mount
2024-02-29 22:21:41 [INFO]  - Creating temp directory
2024-02-29 22:21:41 [DEBUG] - Running: mktemp -d -t backup-XXXXXXXXXX
2024-02-29 22:21:41 [DEBUG] - TMP_DIR=/tmp/backup-Qzjbebgh3H
2024-02-29 22:21:43 [INFO]  - Mounting root partition from loop
2024-02-29 22:21:43 [DEBUG] - Running: mount /dev/loop1p2 /tmp/backup-Qzjbebgh3H
2024-02-29 22:21:44 [INFO]  - Separate boot partition detected, mounting boot inside root
2024-02-29 22:21:45 [DEBUG] - Running: mount /dev/loop1p1 /tmp/backup-Qzjbebgh3H/boot
2024-02-29 22:21:45 [INFO]  - Backing up files
2024-02-29 22:21:45 [INFO]  - Running function: do_rsync
2024-02-29 22:21:46 [INFO]  - Creating temporary file to store rsync output
2024-02-29 22:21:46 [DEBUG] - Backing up to /mnt/raisin/backups/pi/mnt/raisin/backups/pi/10.0.139.144.img
2024-02-29 22:21:46 [DEBUG] - Running: mktemp -t rsync-XXXXXXXXXX
2024-02-29 22:21:46 [DEBUG] - tmp_file=/tmp/rsync-sFc1CSRzUt
2024-02-29 22:21:46 [DEBUG] - Running: rsync -ahvHAX --exclude={/lost+found,/proc/*,/sys/*,/dev/*,/tmp/*,/run/*,/mnt/*,/media/*,/mnt/raisin/backups/pi/*} --info=progress2 --stats --delete --force --partial --delete-excluded / /tmp/backup-Qzjbeb>
------------------------------------------------------------------------------
2024-02-29 22:23:12 [ERROR] - RSYNC FAILED:
var/log/messages
^M          9.64M   0%  211.94kB/s    0:00:44 (xfr#17, to-chk=126/115421)
var/log/syslog
^M         11.17M   0%  245.46kB/s    0:00:44 (xfr#18, to-chk=121/115421)
var/log/wtmp
^M         11.20M   0%  246.15kB/s    0:00:44 (xfr#19, to-chk=111/115421)
var/log/journal/8c97a032a3234a01b7d117c1da498da0/system.journal
^M         36.37M   0%  795.37kB/s    0:00:44 (xfr#20, to-chk=56/115421)
var/log/journal/8c97a032a3234a01b7d117c1da498da0/user-1000.journal
^M         44.76M   0%  976.56kB/s    0:00:44 (xfr#21, to-chk=30/115421)
var/log/lightdm/
var/log/lightdm/lightdm.log
^M         44.89M   0%  961.65kB/s    2:30:19  ^M        109.54M   1%    2.24MB/s    1:02:30  ^M        147.36M   1%    3.07MB/s    0:45:28  ^M        190.02M   2%   36.65MB/s    0:03:47  ^M        234.13M   2%   42.71MB/s    0:03:13  ^M       >
rsync error: error in file IO (code 11) at receiver.c(378) [receiver=3.2.3]
^M        565.74M   6%   12.44MB/s    0:10:40
rsync: [sender] write error: Broken pipe (32)
------------------------------------------------------------------------------
2024-02-29 22:23:12 [ERROR] - Cleanup function called with non zero exit code: exit 1
2024-02-29 22:23:22 [DEBUG] - Unmounting boot partition in cleanup function: umount /tmp/backup-Qzjbebgh3H/boot
2024-02-29 22:23:24 [DEBUG] - Unmounting root partition in cleanup function: umount /tmp/backup-Qzjbebgh3H
2024-02-29 22:23:24 [DEBUG] - Removing loop in cleanup function: losetup -d /dev/loop1
2024-02-29 22:23:24 [DEBUG] - Removing temp directory in cleanup function: rm -rf /tmp/backup-Qzjbebgh3H
2024-02-29 22:23:24 [DEBUG] - Removing temp file in cleanup function: rm /tmp/rsync-sFc1CSRzUt
2024-02-29 22:23:24 [INFO]  - Elapsed time: 01.54
2024-02-29 22:23:24 [DEBUG] - Exiting script
##############################################################################

It is extremely unfortunate that this row is not complete:
2024-02-29 22:21:46 [DEBUG] - Running: rsync -ahvHAX --exclude={/lost+found,/proc/*,/sys/*,/dev/*,/tmp/*,/run/*,/mnt/*,/media/*,/mnt/raisin/backups/pi/*} --info=progress2 --stats --delete --force --partial --delete-excluded / /tmp/backup-Qzjbeb>

I need to know the exact path at the end so I see that it rsyncs to the correct location.
Did the logfile cut off there or was it something else going wrong in the copy/paste process?
But other than that, I do not see any errors other than spelling mistakes and the obvious "broken pipe".

This is an update though.
Could you make a new backup and see if that fails. Include -l to get the log and provide new info here.
Could simply be that the initial backup to this file you are updating to failed somehow.

You still have things mounted that can mess things up, since I see it provided you with loop1, not loop0 (unless you use snaps, or you have found a bug in the script, you can check with lsblk), so please remove these or reboot before running backup.
They SHOULD not mess things up, you should be able to have as many loops as you want, but when tracking for bugs initially, it's better with a clean environment because I suspect that loop0 is still locked to the img file somehow.

Hello!

I think it was just a copy and paste error, this is the whole line.

2024-02-29 22:21:46 [DEBUG] - Running: rsync -ahvHAX --exclude={/lost+found,/proc/*,/sys/*,/dev/*,/tmp/*,/run/*,/mnt/*,/media/*,/mnt/raisin/backups/pi/*} --info=progress2 --stats --delete --force --partial --delete-excluded / /tmp/backup-Qzjbebgh3H

We do have a network drive connected, but that is where the backups are located.

I did notice on the last successful backup (below) that we are using 96% of available blocks.
`2024-02-23 08:03:07 [DEBUG] - Running: e2fsck -p -f -v /dev/loop1p2
rootfs: /lost+found not found. CREATED.

  113837 inodes used (22.91%, out of 496784)
      95 non-contiguous files (0.1%)
     128 non-contiguous directories (0.1%)
         # of inodes with ind/dind/tind blocks: 0/0/0
         Extent depth histogram: 95590/35
 1895057 blocks used (96.35%, out of 1966768)
       0 bad blocks
       1 large file

   86637 regular files
    8793 directories
       0 character device files
       0 block device files
       0 fifos
    1175 links
   18397 symbolic links (18203 fast symbolic links)
       0 sockets
  115002 files

`

That would make sense, if you run the initial backup without adding any space or using -a, it would become a "minimal backup". You should have gotten a warning about this with (unless you ran it with -y you have to press y here to confirm, or it will continue after 5s, witch is actually 6s):

# WARNING!!! Manually added space is smaller than calculated recommended minimum
# This does NOT mean the backup WILL fail, but CAN fail due to lack of space
# Consider using the -a option or manually adding more space
# Requested root size: 4690MiB
# Calculated recommended minimum: 5110MiB

BUT the upgrade on that img with the tag -a should increase the size when ran, but the log you provided found the size difference too small so it did not.
Resize2fs calculated size vs the actual size:

IMG_ROOT_SIZE=8639536640 Bytes | LOCAL_RESIZE2FS_MIN=8698839040 Bytes | diff=56 MB

So with that 56MiB difference, it will not resize: (has to be +512MiB bigger OR +256MiB smaller)

Img root partition is <=256MB smaller or <=512MB bigger compared to auto calculated size, not resizing

So either the script is getting the size on the img root incorrectly or there is something else going on.

To be able to move forward, we have to make an initial backup and look at the log.
Then update that image by using -U and take a look at that log.
If we still fail, we have to analyze the img file istelf and also see if there is anything on your system that confuses the sizes in the script.

Edit
I just had a thought.
You don't have any other stuff mounted that gets included in the backup?
The network mount I assume is where you place the backups, /mnt/raisin/backups/pi/, and the entire /mnt is excluded if you run the default without an exclude.txt and -t, so that's not it.

The size is set by ONLY looking at root, so if you include other partitions you HAVE to add extra space manually because that data will also be included onto the root filesystem on the img file, but not included when calculating size.
--exclude={/lost+found,/proc/*,/sys/*,/dev/*,/tmp/*,/run/*,/mnt/*,/media/*,/mnt/raisin/backups/pi/*}

In the future, this might be different, because I am planning a functionality to include a separate /home partition.
But until then, only root is used with size calculation.

@Dylanroden94 I would really appreciate a followup on this or I will have to close it as wontfix, I can not move forward without further data. 😞

I am getting closer to a new release and it would be great to fix it before that.

If it was some kind of user error, no worries, just let me close the issue. 🙂

Hello!

Sorry for the late response.

Correct, our backup location is located in /mnt/ and looks to be excluded.

I have attached a firstrun text document with the log output from running a new backup.

I have also attached a second file with the log from running an -U backup.

The second backup did not fail.

Second Run.txt
FirstRun.txt

Thank you so much.

So with that working, one can assume it was some freak error (or user error) on the first file and it is not really a bug.

But to me it kinda still sounds like a bug because if the file was only too small it should have been expanded with -a in the update UNLESS my suspicion of something else getting included wos correct. So I am still a bit confused.

The only way to figure out the initial error is to analyze the failing img file, and I am not sure if you are willing to do that.
Would also have been interesting if you manually added space when updating the backup, would it still fail?

I will leave this thread open, and if you or someone responds, I will ofc address it, but I simply can not fix this without way more info, if it even is a bug to begin with.

There really is nothing I can do about this.

If anybody is reading this later.

DO NOT DO THIS TO A DEVELOPER, THERE IS NOTHING i CAN DO AND AM LEFT HANGING WITHOUT ANY INFORMATION!

If you make a report, and find out it was only a user error, PLEASE LET THE DEVELOPER KNOW!!!!!!

I bumped into this problem myself and finally figured it out.
Was a bit easier when I had an img file to work with.

This is what happens.
If the img root filesystem was close to full, and you remove files from you local filesystem to get space in the update, the update might fail like above anyway.
The reason is that I set rsync to delete files during the backup, not before.

Changing --delete to --delete-before fixes the issue and the backup succeeds without a "broken pipe" error.

The problem with this, always building a file list before starting the backup, is that it takes time, not a huge amount, but it takes time.

So I am working on a fix and will close this when the fix is implemented in the testing branch.

The 1.0 release is right around the corner, I just have some final fixes and tests before releasing it but this bugfix will also be included in that release in case you do not want to use the testing branch until the 1.0 release.

I'm really glad I managed to figure this out, it has been bothering me quite a lot. :)

The fix is now in testing branch.

If an error with "broken pipe" presents and you are sure there should be space on the img file, you can now rerun the same shrink-backup command but add --fix.
That way the img file is emptied of files that will be removed BEFORE the backup starts rather than during.
This means the rsync startup takes a little longer, hence why I do not put this into the standard rsync line.

You can also ofc just manually add extra space in the backup instead of relying on -a.

An img file with a "broken pipe" error does not in my experience "break" the img file.
I should probably add that information to README.MD