How to handle paths and references of materials/data nodes from other assays

Question

How to handle paths and references of materials/data nodes from other assays

kMutagene opened this issue 2 years ago · 5 comments

This question came up when trying to create ISA-conform software assays in the ARC structure.

Consider the following arc structure:

root/
├── assays/
│   ├── assay1/
│   │   ├── isa.assay.xlsx
│   │   ├── dataset/
│   │   │   ├──  script.sh
│   │   │   ├──  output.txt
│   ├── assay2/
│   │   ├── isa.assay.xlsx
│   │   ├── dataset/
│   │   │   ├──  script.sh
│   │   │   ├──  output.txt

assay2 uses the output1.txt from assay1. To ensure that this name is correct, there are 2 possibilities in the isa.assay.xlsx files:

Use relative paths from the respective dataset folders:

assay1:

Source Name Parameter[script file] Data File Name

input script.sh output.txt

assay2:

Source Name Parameter[script file] Data File Name

../../assay1/dataset/output.txt script.sh output.txt

Advantages:
- No need to rename files if some scripts/tools produce duplicate filenames. This also somehowe follows current ARC structure design, as files such as isa.assay.xlsx are position-aware (this files describes the assay of the respictive folder, the filename is not unique)
Disadvantages:
- Users must write relative paths. Unless tools like Swate become ARC-aware, I don't think this is feasible.
Force unique filenames as unique material identifier across the whole ARC structure:

assay1:

Source Name Parameter[script file] Data File Name

input script1.sh output1.txt

assay2:

Source Name Parameter[script file] Data File Name

output1.txt script2.sh output2.txt

Advantages:
- No need to write relative paths
Disadvantages:
- Users must make sure to use unique filenames. Unless there is an assisted way of doing this (e.g. ArcCommander amending filenames AND their usage in ISA files) this is also much to ask in a large ARC, but might be the lesser evil.

Answer 1 · 2022-10-19T08:14:26.000Z

This might look like a small issue but gets very annoying in large ARCs. My specific use-case has 3 assays producing 2546 files with the same name each. I ended up prefixing them with the respective assay name to make them unique across the ARC. Without a bulk rename tool or script, this would have been the point where I most likely stopped bothering.

Answer 2 · 2022-10-19T09:01:12.000Z

The issue title might be a bit misleading. Under "material identifier" I understand an ID that identifies some sort of IRL material (sample, chemical) that ideally is assigned a PID and by that persistent and unique.
I'd argue that we cannot / would not want to enforce unique names for individual file names, such as measurement data files or derived data files (outputs of scripts). These files are uniquely identified by path+file name (absolute path). Together with the git url, commit hash, etc. these would also lead to a kind of PID.
Same is already true for the isa*.xlsx files.

Answer 3 · 2022-10-19T09:02:41.000Z

Users must write relative paths.

I don't see that as a disadvantage, but rather making things very clear and traceable.

Answer 4 · 2022-10-19T09:17:48.000Z

The issue title might be a bit misleading. Under "material identifier" I understand an ID that identifies some sort of IRL material (sample, chemical) that ideally is assigned a PID and by that persistent and unique.

True, data nodes in the isa schema are existing on the same level as material nodes, but are not material themselves, updated the title.

I don't see that as a disadvantage, but rather making things very clear and traceable.

From an academic standpoint I agree, but from a user perspective this is extremely cumbersome, as you cannot do this with the current toolchain and IMO cannot expect people to reliably write relative paths (which is then a tooling issue and not directly related to the specs though). Nevertheless, if it is preferred to use relative paths, this should be reflected in the specs for Data nodes.

I'd argue that we cannot / would not want to enforce unique names for individual file names, such as measurement data files or derived data files (outputs of scripts).

Yeah this might be the kicker here. If a tool produces some files such as checkpoints etc., which are named the same for multiple assays enforcing a rename is not what you want.

Answer 5 · 2023-04-13T08:23:48.000Z

We've (@kMutagene, @Freymaurer, @omaus) had some discussion about this topic. Besides our resulting image about the specification itself, I'll also paste our thoughts about how to enforce it.

How do data paths look like?

Paths relative to the ARC root path would be the most generally applicable solution.

This would allow for annotating files that are not stored in the dedicated arc folders and also cross container (assays referencing files in studies and such).
At first glance (see example below), this might seem a little bloated. But the big advantage to this approach would be, that no exception to the rule would be needed. This is an especially important property for getting the ARC ready for automatic computations.

Example

Following the example given by @kMutagene:

Use relative paths from the arc root folder folders:

assay1:

Source Name	Parameter[script file]	Data File Name
input	assays/assay1/dataset/script.sh	assays/assay1/dataset/output.txt

assay2:

Source Name	Parameter[script file]	Data File Name
assays/assay1/dataset/output.txt	assays/assay2/dataset/script.sh	assays/assay2/dataset/output.txt

Where does data path information come from?

Excel

Theoretically possible to receive absolute paths

Swate Electron / Swate in Arcitect

Receive absolute path using electron logic
Infer relative path from this information

Swate Browser

No way to automate, but
user can be supported via input mask
- Extend file picker with path prefix
- More complicated: split into two with "Container name" (assay/study) name and "subprefix"

Swate Excel

s. Excel or Swate Browser

How to fix paths annotations after the fact?

Implement logic in arcIO.NET. This would allow usage in different end-point applications like ArcCommander, Arcitect, and programmatic access.

Split logic in two parts:
- Create list of possible file path annotation fixes
- Implement these fixes
This way each application could implement their own user consent mechanisms.