RockefellerArchiveCenter/zorya

S3ObjectFinder.list_to_download is returning a list that includes objects already in the database

Closed this issue · 5 comments

In routines.py, S3ObjectFinder().list_to_download() should return a list that does not include any bags that are currently saved to the database. However, it is returning a list that includes some bags that have been saved to the database.

In my our bucket linked to development Zorya, there are 3 bags. All 3 bags have the same name as bags saved in the database, so list_to_download should return an empty array. However, it's returning an array with one of the bags.

The files_in_bucket are ['4977307ee0f2493d984484cdb30dbb2b.tar', 'f70e24901427497b8caed6c4d234e7db.tar', 'f7254e2aadc14c849bd6edde66d92307.tar']. The return of list_to_download is ['f70e24901427497b8caed6c4d234e7db.tar']. If I run Bag.objects.filter(original_bag_name__contains='f70e24901427497b8caed6c4d234e7db.tar').exists() outside of this function in the Django shell, it returns True (as it should).

I'm not sure what is happening?

@bonniegee I think this might be because if you have an if/elif when I think what you want is really multiple if statements. I suspect what's happening is the first if is getting triggered so the elif is never executed.

Also, looking at this after a minute, I think that line 46 should use the exists method: Bag.objects.filter(original_bag_name=join(self.src_dir, filename)).exists()

Thanks @helrond -- I'll try if/if.

I'm not sure where I should add the exists method--line 46 is a docstring and line 52 is Bag.objects.filter(original_bag_name__contains=filename).exists().

If/if is returning the same list 😕

Oops I was looking at an older version of the code.

I don't know why yet, but your for loop is only getting executed twice, when really it should be executed three times. I'm not sure if the remove is affecting the iterator. I would suggest just turning lines 49-54 into a one-liner:
[filename for filename in files_in_bucket if not expected_file_name(filename) and Bag.objects.filter(original_bag_name__contains=filename).exists()]

There's probably a more elegant way to write that condition at the end using bool or all.