aodn/python-aodncore

add `*.rm_objects_manifest` to aodn core

Closed this issue · 5 comments

Add a new file extension to manifest files named *.rm_objects_manifest

This manifest file would contain s3 object path of files to remove from s3 such as
IMOS/SRS/Surface-Waves/Wave-Wind-Altimetry-DM00/TOPEX/060N_220E/IMOS_SRS-Surface-Waves_MW_TOPEX_FV02_066N-236E-DM00.nc
And unharvest the files from the DB. The full path would be required so it doesn't need to rely on the dest_path/physical files to find out the path of the files to remove

All pipelines could then use it by adding a simple ``"allowed_extensions": [ '.rm_objects_manifest`" ]``` in chef-private.

The Surface Altimeter pipeline would need this functionality I believe as we have 50000+ files to unharvest

Not sure we need code in aodncore to handle this (so far) one off situation. At this stage I think we can first try to identify the full URL of the file by looking in the DB.

Not sure if that would work OK for a a large number of files in terms of performance but as a one off solution I'm thinking about something like:

SELECT file_url
FROM srs_surface_waves.srs_surface_waves_map
WHERE file_url LIKE ANY(ARRAY['%file1.nc', '%file2.nc', '%file3.nc', etc...])

identifying the url is not the issue. The issue is how to unindex and remove them from s3 in a timely manner as calling the harvester for each file would take far too much time.

This feature could also be used by other harvesters every now and then

Not sure if it's the best approach, but what @lbesnard is proposing might also be needed for #53, i.e. enable "manual" deletion of a list of files by a PO. I still use the old function to get rid of duplicates.

I don't think it would be a good idea to allow this sort of manifest file to be uploaded by external users.

A feature like what @lbesnard is suggesting would be good. It is something needed in the SOOP_CO2_RT workflow (at a much smaller scale though, ~500 files) to manually delete RT files upon reception of the related DM dataset. But I agree with @mhidas that such feature should only be available to POs, not external users.

Feature has been added to code, discussion to be continued elsewhere re: configuration considerations.