cul-it/hfs2dfxml

disk image sometimes modified after hfs2dfxml.py is called

Opened this issue · 14 comments

If you run hfs2dfxml.py against certain HFS disk images in order to generate DFXML it can modify the disk image:

    $ sha256sum my-hfs-disk-image.001
    c3f1cdbe750fa27eeb1ad18c08a135766455f5e29dbebcd049cca076a6f61ea5
    $ python hfs2dfxml.py my-hfs-disk-image.001 hfs2dfxml-output.xml
    $ sha256sum my-hfs-disk-image.001
    6b2f340841e0cc5e7eb8bb474feb8f60e02ff8d8535f2f2cdea1f9adbe1e7bad

Have you seen this behaviour before? I may be able to supply the offending .001 disk image, if helpful.

dd388 commented

I have not seen it before. Can you supply the .001 image so I can try it out to see what's happening?

Are you only seeing this with certain images, or does it happen every time?

Of my 4 test images, only the one is failing (M1126-0001.001 is the same as my-hfs-disk-image.001). It's not my file so I'll have to confirm that I can make it public before sharing it.

Unsure if the humount: No volume is current warning is relevant:

$ sha256sum workfiles/*
c3f1cdbe750fa27eeb1ad18c08a135766455f5e29dbebcd049cca076a6f61ea5  workfiles/M1126-0001.001
19bbaf48dbebf0fe4287b28ce98432e1899be40e8fe661af5abf523444c97c11  workfiles/M22296-0001.001
359c0917411db757665dd48a8a59185abfb78a173201acc5f5544aef0e165009  workfiles/M22717-0007.001
82034091b3058d01e686135a1fdfe92e37ea9284bdba07c4633c099e249271c6  workfiles/uclalsc_ml_227_026.img

$ python hfs2dfxml.py workfiles/M1126-0001.001 1.xml
humount: No volume is current

$ python hfs2dfxml.py workfiles/M22296-0001.001 2.xml
humount: No volume is current

$ python hfs2dfxml.py workfiles/M22717-0007.001 3.xml
humount: No volume is current

$ python hfs2dfxml.py workfiles/uclalsc_ml_227_026.img 4.xml
humount: No volume is current

$ sha256sum workfiles/*
3c7d5ba875162531d8cfffc53cdc1ce418593be8be809c1088b510e491c95952  workfiles/M1126-0001.001
19bbaf48dbebf0fe4287b28ce98432e1899be40e8fe661af5abf523444c97c11  workfiles/M22296-0001.001
359c0917411db757665dd48a8a59185abfb78a173201acc5f5544aef0e165009  workfiles/M22717-0007.001
82034091b3058d01e686135a1fdfe92e37ea9284bdba07c4633c099e249271c6  workfiles/uclalsc_ml_227_026.img

Note that subsequent calls change the checksum in different ways:

$ sha256sum workfiles/M1126-0001.001
c3f1cdbe750fa27eeb1ad18c08a135766455f5e29dbebcd049cca076a6f61ea5  workfiles/M1126-0001.001
$ python hfs2dfxml.py workfiles/M1126-0001.001 1.xml
humount: No volume is current
$ sha256sum workfiles/M1126-0001.001
2c2c41ec318b0265b228e679c56d1213d981b2274ae03063d070c4bd433c6e77  workfiles/M1126-0001.001

Unsure if it's relevant, but the M1126-0001.001 image contains a .DS_Store file.

Note also that I'm using a dev branch of a fork: https://github.com/Hwesta/hfs2dfxml/tree/patch-1

dd388 commented

Thanks for the additional information. The humount: No volume is current message isn't a problem here. The script is just checking to ensure another image isn't already mounted (or was not cleanly unmounted from a previous attempt).

Do let me know if you're able to send the disk image -- otherwise, we can go through the hfsutils calls one by one to see which one is causing the problem.

If the M1126-0001.001 disk image is made read-only, do you get an error running hfs2dfxml.py?

Have you tried running a cmp between the two files?
cmp -l original_image changed_image would help give an idea of the extent of the change.

dd388 commented

@jrwdunham -- Just circling back to this: were you able to see if you could share the disk image with me so I can test this out on my system?

@dd388 unfortunately the NYPL folks may not be available in the very near future to give me authorization to share this image. I'll let you know when I've heard from them and hopefully I can find some time to get more technical details on this issue (cf. suggestions above) in the meantime.

dd388 commented

I do have a copy of the disk image (thank you @jrwdunham!).

Preliminary testing shows just the act of mounting the image using hmount and humount is changing it, independent of my script... More soon.

dd388 commented

First test -- ran hmount/humount on the disk image, and then used unhfs to export all of the files to a directory. Then, I took a clean copy of the disk image, ran unhfs to export all of the files to a different directory.

Did comparisons of all exported files -- as far as I can tell, they're all the same. Curious...

(I also did hexdiff between the altered disk image and the original one, but I couldn't make sense of the results. i.e., if the altered bits actually corresponded to any files.)

dd388 commented

strace fun ahead...

Log output shows the disk image is opened with the flag of O_RDONLY, but then a few lines down is opened again as O_RDWR.

However, I see two write lines, where a small number of bytes of data is written to the file. This does not happen to a control disk image (i.e., one that isn't changed by hmount). But I will note that the control disk image is also first opened as O_RDONLY, then as O_RDWR (though it never gets written to).

I'll have to dig deeper to figure out what this data is, and why it's being written to the file. I'll also try to think about / work on a workaround/fix that sets the file as readonly before the script gets called.

dd388 commented

I re-compiled hfsutils with --enable-debug and ran hmount through gdb, mounting the disk image in question. This part of the log seems relevant:

VOL: "DISKIMG" not cleanly unmounted
VOL: scavenging...
BLOCK: WRITE vol 0x620620 block 2
BLOCK: CACHE vol 0x620620 "DISKIMG" hit/miss ratio = 1.500
VOL: scavenging complete

Then the usual hmount output.

So that suggests to me that something is being set so that the disk image is now seen as "cleanly unmounted." In fact, the second time I mount the disk image, it doesn't show that error.

If I take the disk image, and mount it subsequent times (after the first mount that edits it), it does not seem to change after that. But the way it changes the first time is different each time (which is what @jrwdunham reported.)

So, this doesn't necessarily suggest a tidy solution (to me, so far) but I'll keep digging.

dd388 commented

Make what this as you can: file reports there is Macintosh HFS data (mounted) on this image.

According to libmagic, the mounted part is triggered by a specific pattern: http://www.obscure.org/webmail/program/lib/magic

dd388 commented

If you set the file to readonly (in the filesystem) and run hmount, it does now report that the volume is "locked". There doesn't seem to be an issue with running hls, though -- all of the contents of the image that I'm expecting are there.