jborg/attic

Attic on ARM: Data integrity error

akhayyat opened this issue · 30 comments

When I create an archive using an x86_64 machine, and try to do anything with it using an ARM machine (DreamPlug running Debian testing), Attic on ARM always says:

attic: Error: Data integrity error

This happens with the following commands, when run on ARM:

  1. attic list <repository>
  2. attic list <repository>::<archive>
  3. attic info <repository>::<archive>
  4. attic check --archives-only <repository> (not reproducible now. See below)

At first, when the repository size was small, repository checks appeared to complete successfully, while archive checks didn't. I can't run attic check on ARM anymore; it either segfaults or takes forever.

The repository resides on a disk attached to the ARM machine, and has two archives. One was created on the x86_64 machine, and rsynced to the ARM machine, while the other was created by the x86_64 machine over nfs to the same repository.

All operations on the repository by the x86_64 machine over nfs complete successfully.

Attic was installed via pip3 install --user attic on both machines (version 0.16).

Dependencies on both machines are installed from Debian testing archives:

  • python3-msgpack: 0.4.2-1

@akhayyat can you tell something more about the dreamplug's cpu?

32bits? little or big endian? which arm precisely?

Also, older msgpack versions tend to have issues, could you retry with >= 0.4.6?

CPU: Marvell Kirkwood, ARMv5, little-endian, 32-bit. Debian arch: armv5tel
Memory: 512 MiB 16bit DDR2-800 MHz

I installed attic again in a virtualenv on ARM, which brought in msgpack 0.4.6. attic list and attic info still say Data integrity error. This is on the repository/archives created by the old setup, though, i.e. msgpack 0.4.2 on x86_64.

Please do a full test cycle with recent msgpack. Older msgpack versions had corruption issues under some circumstances (there's another ticket about this).

I recreated the repository and one archive on an x86_64 machine with Attic 0.16 and msgpack 0.4.6 from PyPI in a virtualenv, and rsynced it to the ARM machine.

attic list and attic info on ARM with a matchnig setup (same Attic and msgpack versions in a virtualenv) still result in a Data integrity error.

This is a keyfile-encrypted repository, if it makes any difference.

A simpler failed test. Succeeds on x86_64. Fails on ARM..
Not sure if it's related to the initial problem..

$ attic init ~/tmp.attic
Initializing repository at ".../tmp.attic"
Encryption NOT enabled.
Use the "--encryption=passphrase|keyfile" to enable encryption.
Initializing cache...

$ attic create ~/tmp.attic::mydir-1 ~/mydir
Traceback (most recent call last):
  File ".../.virtualenvs/attic/bin/attic", line 3, in <module>
    main()
  File ".../.virtualenvs/attic/lib/python3.4/site-packages/attic/archiver.py", line 730, in main
    exit_code = archiver.run(sys.argv[1:])
  File ".../.virtualenvs/attic/lib/python3.4/site-packages/attic/archiver.py", line 720, in run
    return args.func(args)
  File ".../.virtualenvs/attic/lib/python3.4/site-packages/attic/archiver.py", line 130, in do_create
    archive.save()
  File ".../.virtualenvs/attic/lib/python3.4/site-packages/attic/archive.py", line 209, in save
    self.repository.commit()
  File ".../.virtualenvs/attic/lib/python3.4/site-packages/attic/repository.py", line 130, in commit
    self.compact_segments()
  File ".../.virtualenvs/attic/lib/python3.4/site-packages/attic/repository.py", line 194, in compact_segments
    assert segments[segment] == 0
AssertionError

I did the same commands (init + create) on a raspberry pi 2 (armv7, 32bit, raspbian / debian wheezy, python 3.2.3) without problems.

I used a current git repo checkout and installed everything into a virtualenv (including a fresh msgpack etc.).

I also tried to init a (unencrypted) repo and created 2 archives on x86_64, scped it to the rpi2 and then did a repo check and repo extract --dry-run -- no issues. Also tried with encrypted repo (passphrase), no problem.

The Raspberry Pi uses a different CPU. The original Pi is ARMv6 (and I think the Pi 2 is ARMv7). The DreamPlug is ARMv5.

I have a few other Python programs running on the DreamPlug with no issues.

What else can I do to help debug this problem?

@akhayyat your experiment, was that a rather large repo/archive? how big? does your dreamplug have swapspace in case it runs out of memory? was there enough disk space in your home / in your repo location? did you check if there is a permissions issue on the repo?

you could also run the unit tests (but there are some known issues, see the pull requests).

No, it was a small archive (800kB, ~90 files/dirs). Plenty of free disk space and memory. No permission issues.

Also, although it is a larger archive (2.5GB, ~24k files), the first archive works fine over NFS on x86_64, but not locally on ARM (attic list -> Data integrity error), so it's none of the obvious problems above.

@akhayyat please post output of cat /proc/cpuinfo, uname -a and cat /proc/cpu/alignment. If the latter doesn't end with "(fixup)", also have a look at the end of your kernel log after running into the attic issue - anything special there?

$ cat /proc/cpuinfo
processor       : 0
model name      : Feroceon 88FR131 rev 1 (v5l)
BogoMIPS        : 1191.11
Features        : swp half thumb fastmult edsp 
CPU implementer : 0x56
CPU architecture: 5TE
CPU variant     : 0x2
CPU part        : 0x131
CPU revision    : 1

Hardware        : Marvell Kirkwood (Flattened Device Tree)
Revision        : 0000
Serial          : 0000000000000000
$ uname -a
Linux dream 3.10-3-kirkwood #1 Debian 3.10.11-1 (2013-09-10) armv5tel GNU/Linux
$ cat /proc/cpu/alignment
User:           1443368
System:         3169
Skipped:        0
Half:           0
Word:           3169
DWord:          0
Multi:          0
User faults:    0 (ignored)

Nothing relevant in syslog, kern.log, or messages.

the alignment stats look like here are a lot of userspace alignment errors and the policy to treat them is to IGNORE them (not FIXUP, which would mean to correct them). Guess that means wrong data read from memory and that would explain a lot.

http://www.aleph1.co.uk/chapter-10-arm-structured-alignment-faq

On the raspberry pi2, is see no "user" faults, no "word" faults and the handling mode is "2 (fixup)". Might be because ARM v7 cpu is able to do unaligned accesses up to word length without the "fixup" fault handler helping out.

You could use echo 3 > /proc/cpu/alignment to have Linux fixup the misaligned memory accesses and provide some dmesg output (or 2, if you don't want to have it in dmesg output).

So, not sure whether this is related to the integrity error, but it might be worth investigating.

Brilliant! You're the man @ThomasWaldmann!

echo 2 > /proc/cpu/alignment

did the trick. attic list and attic info succeeds now \o/

\o/

Even better (for performance) would be if there were no unaligned (word?) accesses.

This is my first time trying to debug a python program using gdb, so I may have done something wrong in the following.

$ echo 4 > /proc/cpu/alignment  # generate a bus error instead of ignore or fix
$ gdb python
(gdb) run ~/.virtualenvs/attic/bin/attic list <path-to-repository>
Starting program: ...
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabi/libthread_db.so.1".

Program received signal SIGBUS, Bus error.
0xb65535e0 in hashindex_lookup (index=0xb6553590 <hashindex_lookup+32>, index@entry=0x4a4038,
    key=0xb68d0688) at attic/_hashindex.c:85
85      attic/_hashindex.c: No such file or directory.

The HashHeader length is not divisable by 4. And that makes the stuff after it misaligned.

Locate this line in _hashindex.c:

} __attribute__((__packed__)) HashHeader;

Above it, add this line:

int8_t  dummy1, dummy2;

Try again (create new repo).

That worked!
But it changes the repository format, doesn't it?

It's used for the local cache and also for the repository index.

Maybe @jborg can help telling how to fix that best.

Although this seems to be a dealbreaker only on older ARM cpus, it might give a little performance advantage elsewhere also if stuff is aligned on 32 (or even 64?) bit boundaries.

"Ignore" mode seems like a gigant foot gun to me. Why would anyone prefer silent data corruption instead of bus error or automatic fixup?

Anyway, here's a patch that should make hashindex's memory access 32 bit aligned without changing the data format. But since I don't have any way to test this myself please let me know if it works or not

https://gist.github.com/jborg/1255a877d288a7a504ef

Thanks @jborg for looking into this.
The patch did not solve the problem. Here is the gdb session for an attic init command:

(gdb) run <virtual_env_path>/bin/attic init ~/tmp.attic
Starting program: <virtual_env_path>/bin/python <virtual_env_path>/bin/attic init ~/tmp.attic
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabi/libthread_db.so.1".
Initializing repository at "<home>/tmp.attic"
Encryption NOT enabled.
Use the "--encryption=passphrase|keyfile" to enable encryption.

Program received signal SIGBUS, Bus error.
0xb63b1efc in hashindex_write (path=0xb637ac00 "<home>/tmp.attic/index.tmp", index=0x4ea478)
    at attic/_hashindex.c:273
273         *((uint32_t *)(index->data + 8)) = _htole32(index->num_entries);

@jborg haha about the foot gun. Thought somehow same when seeing this.

@akhayyat sorry about that. I tried to use a trick to make a minimal patch that would apply to both 0.16 and git master.
Anyway, I've now pushed a general code cleanup that should also fix this issue as a side effect.
But since I don't have access to this kind of hardware a confirmation would be nice :)

If you're using 0.16 and not git and 2b34810 doesn't apply cleanly it should be safe to use _hashindex.c from git as drop in replacement.

I'm afraid it's not over yet..
Using the current master, attic list works fine on an existing x86_64-created repository, but info and init still cause bus errors.

$ git rev-parse HEAD
2b348104f668836f9e00103681e3bc85cb49ecae

(gdb) run <virtualenv_path>/bin/attic init tmp.attic
Starting program: <virtualenv_path>/bin/python <virtualenv_path>/bin/attic init tmp.attic
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabi/libthread_db.so.1".
Initializing repository at "tmp.attic"
Encryption NOT enabled.
Use the "--encryption=passphrase|keyfile" to enable encryption.

Program received signal SIGBUS, Bus error.
0xb63b1efc in hashindex_write (path=0xb638b328 "tmp.attic/index.tmp", index=0x3ee990)
    at attic/_hashindex.c:273
warning: Source file is more recent than executable.
273         if(fclose(fd) < 0) {

attic info fails at the same line, too.

Program received signal SIGBUS, Bus error. 0xb63b1efc in hashindex_write (path=0xb638b328 "tmp.attic/index.tmp", index=0x3ee990) at attic/_hashindex.c:273 warning: Source file is more recent than executable. 273 if(fclose(fd) < 0) {

Are you sure everything is recompiled properly?
gdb warn about the source file being more recent than the executable. Also line 273 is the exact same line number as the crash you got from my first patch.

Try "touch attic/*.pyx" to force setup.py to recompile everything.

Oops.. Sorry about that.. You're right. After properly recompiling, everything seems to work correctly :-)
Thanks @ThomasWaldmann and @jborg for the prompt response and the quick fix!

Hmm.. This fix is the 600th commit! 😎

Seems fixed to me too. I had problems running on ARM before. Now with the same dataset (and attic from git master) I can save & restore without errors, and it passes diff -r :).