fatiando/pooch

Repeated fetching with differing Untar processors yields wrong results

dokempf opened this issue · 4 comments

Description of the problem:

The use case is to download an archive, unpack it and get access to individual files within that archive through the members parameter of pooch.Untar. When doing so, the members argument of the first call is cached for subsequent calls, which prohibits a usage pattern that looks otherwise valid. pooch.Unzip is likely to have the same issue.

Full code that generated the error

import pooch
P = pooch.create(
    path=pooch.os_cache("test"),
    base_url="https://github.com/ssciwr/afwizard-test-data/releases/download/2022-06-09/",
    registry={
        "data.tar.gz": "sha256:fae90a3cf758e2346b81fa0e3b005f2914d059ca182202a1de8f627b1ec0c160"
    }
)
files1 = P.fetch("data.tar.gz", processor=pooch.Untar(members=["testfilter1.json"]))
files2 = P.fetch("data.tar.gz", processor=pooch.Untar(members=["testfilter2.json"]))
assert files1[0] != files2[0]

Full error message

Above assertion fails.

Hi @dokempf thank you for reporting this! I can confirm that this is a bug but it's not because of caching members, which we don't do. The code that generates the file names returned by Untar and Unzip is: https://github.com/fatiando/pooch/blob/main/pooch/processors.py#L88

It generates the file names by crawling the unpacked directory on disk. This works fine if we're extracting all members or only a single one. But your use case fails since files2 now contains both members. Could you please try printing files1 and files2 to confirm?

The fix would be to modify the code linked above to take into account the members argument when producing file names for the return statement. We'd also need to add a test that covers this, perhaps here https://github.com/fatiando/pooch/blob/main/pooch/tests/test_processors.py#L144

@dokempf would you like to take a shot at this? No pressure if not, just asking in case you're interested (or in a hurry to have this fixed since I wouldn't be able to do it until August).

@leouieda Thanks for having a look! Actually, in above example, both files1 and files2 contain only the testfilter1.json file and the testfilter2.json files has never been unpacked (I verified by looking at the cache).

I can sure have a shot at the problem when we have identified the problem. For the use case where this arose, I have avoided specifying members (workaround here: https://github.com/ssciwr/afwizard/blob/main/afwizard/paths.py#L96:L109), so I am not in a hurry.

Interesting. I think it's because in this line https://github.com/fatiando/pooch/blob/main/pooch/processors.py#L82 we only check if the base unpacked directory exists before extracting and not the individual members requested. So that check needs to be improved to check for the members as well.

Yeah, that line explains the behaviour quite well, I think. Will have a look and report back.