Repeated fetching with differing Untar processors yields wrong results
dokempf opened this issue · 4 comments
Description of the problem:
The use case is to download an archive, unpack it and get access to individual files within that archive through the members
parameter of pooch.Untar
. When doing so, the members
argument of the first call is cached for subsequent calls, which prohibits a usage pattern that looks otherwise valid. pooch.Unzip
is likely to have the same issue.
Full code that generated the error
import pooch
P = pooch.create(
path=pooch.os_cache("test"),
base_url="https://github.com/ssciwr/afwizard-test-data/releases/download/2022-06-09/",
registry={
"data.tar.gz": "sha256:fae90a3cf758e2346b81fa0e3b005f2914d059ca182202a1de8f627b1ec0c160"
}
)
files1 = P.fetch("data.tar.gz", processor=pooch.Untar(members=["testfilter1.json"]))
files2 = P.fetch("data.tar.gz", processor=pooch.Untar(members=["testfilter2.json"]))
assert files1[0] != files2[0]
Full error message
Above assertion fails.
Hi @dokempf thank you for reporting this! I can confirm that this is a bug but it's not because of caching members
, which we don't do. The code that generates the file names returned by Untar
and Unzip
is: https://github.com/fatiando/pooch/blob/main/pooch/processors.py#L88
It generates the file names by crawling the unpacked directory on disk. This works fine if we're extracting all members or only a single one. But your use case fails since files2
now contains both members. Could you please try printing files1
and files2
to confirm?
The fix would be to modify the code linked above to take into account the members
argument when producing file names for the return
statement. We'd also need to add a test that covers this, perhaps here https://github.com/fatiando/pooch/blob/main/pooch/tests/test_processors.py#L144
@dokempf would you like to take a shot at this? No pressure if not, just asking in case you're interested (or in a hurry to have this fixed since I wouldn't be able to do it until August).
@leouieda Thanks for having a look! Actually, in above example, both files1
and files2
contain only the testfilter1.json
file and the testfilter2.json
files has never been unpacked (I verified by looking at the cache).
I can sure have a shot at the problem when we have identified the problem. For the use case where this arose, I have avoided specifying members
(workaround here: https://github.com/ssciwr/afwizard/blob/main/afwizard/paths.py#L96:L109), so I am not in a hurry.
Interesting. I think it's because in this line https://github.com/fatiando/pooch/blob/main/pooch/processors.py#L82 we only check if the base unpacked directory exists before extracting and not the individual members requested. So that check needs to be improved to check for the members as well.
Yeah, that line explains the behaviour quite well, I think. Will have a look and report back.