How to handle paths and references of materials/data nodes from other assays
kMutagene opened this issue · 5 comments
This question came up when trying to create ISA-conform software assays in the ARC structure.
Consider the following arc structure:
root/
├── assays/
│ ├── assay1/
│ │ ├── isa.assay.xlsx
│ │ ├── dataset/
│ │ │ ├── script.sh
│ │ │ ├── output.txt
│ ├── assay2/
│ │ ├── isa.assay.xlsx
│ │ ├── dataset/
│ │ │ ├── script.sh
│ │ │ ├── output.txt
assay2
uses the output1.txt
from assay1
. To ensure that this name is correct, there are 2 possibilities in the isa.assay.xlsx
files:
-
Use relative paths from the respective
dataset
folders:assay1
:Source Name Parameter[script file] Data File Name input script.sh output.txt assay2
:Source Name Parameter[script file] Data File Name ../../assay1/dataset/output.txt script.sh output.txt Advantages:
- No need to rename files if some scripts/tools produce duplicate filenames. This also somehowe follows current ARC structure design, as files such as
isa.assay.xlsx
are position-aware (this files describes the assay of the respictive folder, the filename is not unique)
Disadvantages:
- Users must write relative paths. Unless tools like Swate become ARC-aware, I don't think this is feasible.
- No need to rename files if some scripts/tools produce duplicate filenames. This also somehowe follows current ARC structure design, as files such as
-
Force unique filenames as unique material identifier across the whole ARC structure:
assay1
:Source Name Parameter[script file] Data File Name input script1.sh output1.txt assay2
:Source Name Parameter[script file] Data File Name output1.txt script2.sh output2.txt Advantages:
- No need to write relative paths
Disadvantages:
- Users must make sure to use unique filenames. Unless there is an assisted way of doing this (e.g. ArcCommander amending filenames AND their usage in ISA files) this is also much to ask in a large ARC, but might be the lesser evil.
This might look like a small issue but gets very annoying in large ARCs. My specific use-case has 3 assays producing 2546 files with the same name each. I ended up prefixing them with the respective assay name to make them unique across the ARC. Without a bulk rename tool or script, this would have been the point where I most likely stopped bothering.
-
The issue title might be a bit misleading. Under "material identifier" I understand an ID that identifies some sort of IRL material (sample, chemical) that ideally is assigned a PID and by that persistent and unique.
-
I'd argue that we cannot / would not want to enforce unique names for individual file names, such as measurement data files or derived data files (outputs of scripts). These files are uniquely identified by path+file name (absolute path). Together with the git url, commit hash, etc. these would also lead to a kind of PID.
Same is already true for the isa*.xlsx files.
Users must write relative paths.
I don't see that as a disadvantage, but rather making things very clear and traceable.
The issue title might be a bit misleading. Under "material identifier" I understand an ID that identifies some sort of IRL material (sample, chemical) that ideally is assigned a PID and by that persistent and unique.
True, data nodes in the isa schema are existing on the same level as material nodes, but are not material themselves, updated the title.
I don't see that as a disadvantage, but rather making things very clear and traceable.
From an academic standpoint I agree, but from a user perspective this is extremely cumbersome, as you cannot do this with the current toolchain and IMO cannot expect people to reliably write relative paths (which is then a tooling issue and not directly related to the specs though). Nevertheless, if it is preferred to use relative paths, this should be reflected in the specs for Data nodes.
I'd argue that we cannot / would not want to enforce unique names for individual file names, such as measurement data files or derived data files (outputs of scripts).
Yeah this might be the kicker here. If a tool produces some files such as checkpoints etc., which are named the same for multiple assays enforcing a rename is not what you want.
We've (@kMutagene, @Freymaurer, @omaus) had some discussion about this topic. Besides our resulting image about the specification itself, I'll also paste our thoughts about how to enforce it.
How do data paths look like?
Paths relative to the ARC root path
would be the most generally applicable solution.
This would allow for annotating files that are not stored in the dedicated arc folders and also cross container (assays referencing files in studies and such).
At first glance (see example below), this might seem a little bloated. But the big advantage to this approach would be, that no exception to the rule would be needed. This is an especially important property for getting the ARC ready for automatic computations.
Example
Following the example given by @kMutagene:
Use relative paths from the arc root folder folders:
assay1
:
Source Name | Parameter[script file] | Data File Name |
---|---|---|
input | assays/assay1/dataset/script.sh | assays/assay1/dataset/output.txt |
assay2
:
Source Name | Parameter[script file] | Data File Name |
---|---|---|
assays/assay1/dataset/output.txt | assays/assay2/dataset/script.sh | assays/assay2/dataset/output.txt |
Where does data path information come from?
Excel
- Theoretically possible to receive absolute paths
Swate Electron / Swate in Arcitect
- Receive absolute path using electron logic
- Infer relative path from this information
Swate Browser
- No way to automate, but
- user can be supported via input mask
- Extend file picker with path prefix
- More complicated: split into two with "Container name" (assay/study) name and "subprefix"
Swate Excel
- s. Excel or Swate Browser
How to fix paths annotations after the fact?
Implement logic in arcIO.NET. This would allow usage in different end-point applications like ArcCommander, Arcitect, and programmatic access.
Split logic in two parts:
- Create list of possible file path annotation fixes
- Implement these fixes
This way each application could implement their own user consent mechanisms.