wal-e/wal-e

gpg problem while restoring encrypted backup

omega opened this issue · 6 comments

omega commented

While setting up encryption of our wal-e backups in our test environment I have stumbled upon a slight problem when trying to restore it to a slightly different environemnt. I dont' know if this comes down to different gpg-versions, or something else, but here goes.

Server backups are made from (tried both GCS and file-backend):

gpg --version: 1.4.20
wal-e version: 1.1.0
postgres 10.4

Machine I attemp to restore to:
gpg --version 2.2.4
wal-e version: 1.1.0
postgres 10.5

I do the following:

gpg --recipient "<key_id>" --encrypt /tmp/tmp.18nEVROjB7
gpg --decrypt "/tmp/tmp.18nEVROjB7.gpg"

envdir restore/env /usr/local/bin/wal-e backup-fetch /var/lib/postgresql/10/tmprestore LATEST

The output from wal-e is then:

wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "backup-fetch".
        STRUCTURED: time=2018-08-29T06:32:46.298928-00 pid=14258
wal_e.worker.file.file_worker INFO     MSG: beginning partition download
        DETAIL: The partition being downloaded is part_00000000.tar.lzo.
        HINT: The absolute file key is home/omega/tmp/koriander-pg-backup/basebackups_005/base_000000010000011D00000008_00059128/tar_partitions/part_00000000.tar.lzo.
        STRUCTURED: time=2018-08-29T06:32:46.413433-00 pid=14258
wal_e.worker.file.file_worker INFO     MSG: beginning partition download
        DETAIL: The partition being downloaded is part_00000004.tar.lzo.
        HINT: The absolute file key is home/omega/tmp/koriander-pg-backup/basebackups_005/base_000000010000011D00000008_00059128/tar_partitions/part_00000004.tar.lzo.
        STRUCTURED: time=2018-08-29T06:32:46.420522-00 pid=14258
wal_e.worker.file.file_worker INFO     MSG: beginning partition download
        DETAIL: The partition being downloaded is part_00000001.tar.lzo.
        HINT: The absolute file key is home/omega/tmp/koriander-pg-backup/basebackups_005/base_000000010000011D00000008_00059128/tar_partitions/part_00000001.tar.lzo.
        STRUCTURED: time=2018-08-29T06:32:46.426142-00 pid=14258
wal_e.worker.file.file_worker INFO     MSG: beginning partition download
        DETAIL: The partition being downloaded is part_00000003.tar.lzo.
        HINT: The absolute file key is home/omega/tmp/koriander-pg-backup/basebackups_005/base_000000010000011D00000008_00059128/tar_partitions/part_00000003.tar.lzo.
        STRUCTURED: time=2018-08-29T06:32:46.431523-00 pid=14258
gpg: block_filter 0x0000564683c9aa80: read error (size=15700,a->size=15700)
gpg: block_filter 0x0000564683c9a060: read error (size=16187,a->size=16187)
gpg: WARNING: encrypted message has been manipulated!
gpg: block_filter: pending bytes!
gpg: block_filter: pending bytes!

the base_backup files are copied with scp from the remote server, and manually decrypting them with gpg --decrypt works on all of them seemingly without error, so it seems the files themselves are ok?

Not sure how or where to dig deeper in this, but please let me know if I can provide any additional information. I can't share this backup or the keys used, but if it's needed I can try to make another DB and backup that and encrypt it with someones key if that helps

Check the status of the key (like , if it is expired). In that case, wal-e does the push within an invalid key (probably is non-encrypted).

omega commented

as I wrote towards the end, I can decrypt the files manually with gpg --decrypt just fine using the same key, and the key is not expired (I can also see it using it when it does the push), so I'm a bit confused as to what could be causing this

I have the same issue with Azure Storage. As @omega stated he tried GCS and file-backend
it looks like the issue is not related to the sotrage backend.
I also made the obervation that using gpg on the commandline works if the file was not downloaded by wal-e

I don't fully understand the issue. However I came across: #132
In this issue it is described that the root cause was a missing filesystem level authorization: "didn't have permission to create subfolder."

I've found the root cause of the issue (my case). Wal-e basically sets up a data processing pipeline consisting of the following steps:

  1. Azure File Storage -> Download files/binary data via Azure Python SDK
  2. Pipe the downloaded files into gpg to decrypt the incoming data stream (using the corresponding gpg cli)
  3. Decompress the data stream using LZO (using the corresponding lzo cli)
  4. Untar the single tar archive into the originating postgres data directory structure (using the python tarfile module)

My backup contained a symlink, as the pg_wal directory is stored on a separate disk.
I didn't load the backup into an empty folder, I loaded it into a newly initialized data directory.
So the pg_wal directory existed already. Obviously the tarfile module in step 4 was fine with existing directory if the tarfile contained a corresponding directory, too. However it failed with an exising direcotry when the tarfile contained a corresponding symlink.
This caused the error above.
Hope this helps @omega to find your root cause

Hi guys!

Me and @vipin030 might have found something.

The information below is based on the following system:
CPU cores: 64
Memory: 409 GB
GPG: 2.2.4
libgcrypt: 1.8.1
wal-e: 1.1.0
PostgreSQL: 11.10

When we were trying to restore a large wal-e backup (12+ TB, from a Google Cloud bucket), we discovered that setting --pool-size higher than 12 always gave us lots of decryption errors. Lowering it to 12 made the errors disappear, or so we thought. We got 1 gpg error now, exactly the kind in @omega's log above. We've verified that the file is in fact not corrupt. Will try to lower pool size to 10, restore again and try to see if it disappears.