ssl-hep/ServiceX

Track file replicas separately from files

Opened this issue · 1 comments

The dataset table represents the list of files connected to a particular DID. Requests are tied to their dataset record. While the list of files is guaranteed to not change, the specific replicas of these files can change all of the time.

We need a way to refresh the list of replicas without introducing referential integrity problems in the transform_results table.

The best solution would be to have a separate table for replicas. The transform results table will reference the file table. When we want to refresh the list of replicas we would delete the replica records for the dataset, but leave the files in place.

The submit_transform workflow will check to see if the dataset record exists or if the list of replicas is empty to decide when to go to Rucio

Dataset table does not have list of files. It has I for on dataset. Info on files is in Files table. While number of files connected to a dataset rarely changes(only during lookup or if files were declared lost), paths variable changes.
While it would work with the new table for replicas, it is really not necessary. One can simply update paths variable in existing file record.
We should probably make a list of all the changes needed on the db and apply them in one go. It would be nice to also drop all the old versions.