Step target example broken
gaow opened this issue · 29 comments
Separated from a private correspondence: the RNA-seq DE example workflow now is broken. To reproduce:
cd sos-docs/src/examples
sos run RNASeqDE.ipynb align -j 2 --samples SRR1039508
Error message:
ERROR: Multiple steps hg19_reference_1, hg19_reference_2, hg19_reference_3 to generate target sos_step("hg19_reference")
This used to work, though. I guess maybe it is similar to issue in #981 introduced due to recent changes of target dependency codes.
In this example, you have
[hg19_reference_1 (download)]
[hg19_reference_2 (decompress hg19.fa)]
[hg19_reference_3 (gene annotations)]
and then
depends: sos_step('hg19_reference')
so are you depending on a subworkflow?
Yes. I wonder if this is okay, because I believe it used to work.
I never knew it worked, so I would say it was an undefined behavior and we need to formalize now. It is like triggering a subworkflow from dependents. I mean, if you can do
[sub_1]
[sub_2]
[step]
depends: sos_step('sub')
instead of
[sub_1]
[sub_2]
[step]
sos_run('sub')
why would not we allow formal workflow here?
[A_1]
[A_2]
[B_1]
[step]
depends: sos_step('A+B')
Sorry I'm a bit confused. Is this a proposal? It seems reasonable, if not difficult ... right?
I do not know how difficult it is. I am just saying that the behavior was undefined before and triggers an error now, and if we are to support it, we are essentially allowing the triggering of an entire workflow, not just sos_step
, and then if we are triggering workflow, sos_step('A+B')
makes sense, then the name sos_step
might be inappropriate.
I see. I'm okay with the name sos_step
though if we decide to support it.
But do we want to support it? It seems an natural extension and feature; but if it is not a robust feature and can cause potential problems we then would rather not have it? Currently we do not have an immediate need for it -- in the RNA-seq DE example I wanted to demonstrate the feature. Though I can also try to use file targets, the logic seems more intuitive with this dependency on the mini-workflow. What do you think?
In our manuscript, I specifically said something like: the trunk of the workflow can be outcome based where process-oriented subworkflows are triggered ... So conceptually speaking this is what we want to do, and we do not have an explicit syntax for it.
Indeed ... I guess we should formalize it. I think you've make it clear about the behavior, but we need new syntax like workflow_target()
because we have file_target()
? Or sos_workflow()
similar to sos_variable()
?
Yes, sos_workflow()
is clearer than sos_step(workflow)
, but sos_step
has caused enough trouble and I do not know if I want to stab into sos_workflow
now.
a
does not match a_1
etc now. The loophole was for a_0
as far as I remember but now a
would not even match to a_0
.
In your case
[hg19_reference_1 (download)]
[hg19_reference_2 (decompress hg19.fa)]
[hg19_reference_3 (gene annotations)]
[default]
depends: sos_step('hg19_reference')
would not work now, but
[hg19_reference_1 (download)]
[hg19_reference_2 (decompress hg19.fa)]
[hg19_reference_3 (gene annotations)]
[default]
depends: sos_step('hg19_reference_1'), sos_step('hg19_reference_2'), sos_step('hg19_reference_3')
would work. I am wondering if we can allow something like
[hg19_reference_1 (download)]
[hg19_reference_2 (decompress hg19.fa)]
[hg19_reference_3 (gene annotations)]
[default]
depends: sos_step('hg19_reference_*')
It is conceptually not a workflow, but multiple sos_step
specified using a wildcard character.
I agree with not messing too much with the code at this moment until there is a real need. For mini-workflow, one can always do
[C]
sos_run('A+B')
[default]
depends: sos_step('C')
right? then it should be good to claim handling mini-workflows. But I think the _*
case is one particular exception that worth a special handling. Because this is what people might often do -- the most common type of process-oriented subworkflow is one with numeric ordering. So i think it would be good to support sos_step('hg19_reference')
directly and under the hood we expend it to match all steps.
There is a minor difference here: we claim the "connectedness" of workflow steps so in sos_run('a')
we know that a_2
would inherit output from a_1
. In sos_step('a_1')
, sos_step)('a_2')
(or sos_step('a')
as you proposed), the steps are not connected and are executed as separate steps.
So in the end there might be a difference between
[C]
sos_run('A')
[default]
depends: sos_step('C')
and
[default]
depends: sos_step('A')
although the latter might just work in a majority of the cases.
Okay, but I guess if sos_step('hg19_reference_*')
does not assume connectedness there might be more confusion than benefit, because my hg19_reference
workflow does assume the steps involved follow numeric ordering ...
ok, I can ensure order so sos_step('a')
would essentially be sos_workflow('a')
, which actually make the latter unnecessary because sos_workflow('a+b')
is rarely needed and users can use the extra-step trick if needed.
The problem with the implementation is that we are using separate DAGs for sos_run('a')
but in this case the steps would be added to the master DAG, which assumes one numerically ordered forward-style trunk. Adding multiple numerically ordered mini-workflows would complicate the DAG building (but not necessarily difficult though).
Hmm now I guess my nested dependency has made things more complicated:
$ sos run RNASeqDE.ipynb align --samples SRR1039508
ERROR: Defined output fail to produce expected output: /home/gaow/GIT/LargeFiles/RNAseqDE/hg19/hg19.2bit /home/gaow/GIT/LargeFiles/RNAseqDE/hg19/twoBitToFa generated, /home/gaow/GIT/LargeFiles/RNAseqDE/hg19/genomeParameters.txt expected.
But sos run RNASeqDE.ipynb star
now works as expected thanks to the latest patch. I'll try to run that step for now because I know this can take a few hours. Will test other steps after we sort this out.
Unfortunately, sos_step('a')
does not work because it fails to connect the input / output. Here is MWE:
[hg_1]
output: "1.txt"
run:
touch 1.txt
[hg_2]
output: "2.txt"
run: expand = True
echo {_input[0]} > 2.txt
[default]
depends: sos_step('hg')
and it throws IndexError: list index out of range
Okay, now I get this error:
$ sos run RNASeqDE.ipynb star
INFO: download (index=0) is ignored due to saved signature
INFO: output: /home/gaow/GIT/LargeFiles/RNAseqDE/hg19/hg19.2bit, /home/gaow/GIT/LargeFiles/RNAseqDE/hg19/twoBitToFa
INFO: decompress hg19.fa (index=0) is ignored due to saved signature
INFO: output: /home/gaow/GIT/LargeFiles/RNAseqDE/hg19/hg19.fa
INFO: gene annotations (index=0) is ignored due to saved signature
INFO: output: /home/gaow/GIT/LargeFiles/RNAseqDE/hg19/Homo_sapiens.GRCh38.91.gtf.gz
INFO: Target unavailable: sos_step("hg19_reference")
ERROR: Completed target sos_step("hg19_reference") is being re-executed. Please report this bug to SoS developers.
Trying to figure out how to trigger this bug from the test case
[a_1]
output: "a_1"
sh:
echo whatever > a_1
[a_2]
output: "a_2"
sh: expand=True
cp {_input} {_output}
[default]
depends: sos_step('a')
It was caused by the _
inside hg19_reference
. Fixed now.
At this point I was able to test and verify separately the 2 commands:
sos run RNASeqDE.ipynb star
sos run RNASeqDE.ipynb align
but
sos run RNASeqDE.ipynb align
by itself still fails. See this thread for details: #983 (comment)
I see
$ sos run RNASeqDE.ipynb star
INFO: download (index=0) is ignored due to saved signature
INFO: output: /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19/hg19.2bit, /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19/twoBitToFa
INFO: decompress hg19.fa (index=0) is ignored due to saved signature
INFO: output: /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19/hg19.fa
INFO: gene annotations (index=0) is ignored due to saved signature
INFO: output: /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19/Homo_sapiens.GRCh38.91.gtf.gz
STAR --runMode genomeGenerate \
--genomeDir /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19 \
--genomeFastaFiles /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19/hg19.fa \
--sjdbGTFtagExonParentTranscript /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19/Homo_sapiens.GRCh38.91.gtf.gz \
--runThreadN 4
Jun 16 14:13:27 ..... started STAR run
Jun 16 14:13:27 ... starting to generate Genome files
EXITING because of INPUT ERROR: could not open genomeFastaFile: /Users/bpeng1/GIT/LargeFiles/RNAseqDE/hg19/hg19.fa
Jun 16 14:13:27 ...... FATAL ERROR, exiting
ERROR: [star]: Executing script in docker returns an error (exitcode=104).
The script has been saved to /Users/bpeng1/sos/sos-docs/src/examples/.sos/docker_run_76367.sh. To reproduce the error please run:
docker run --rm -v /private/tmp:/private/tmp -v /Users/bpeng1/sos/sos-docs/src/examples/.sos/docker_run_76367.sh:/var/lib/sos/docker_run_76367.sh -t -P -w=/private/tmp -u 1985961928:895809667 gaow/debian-ngs /bin/bash -ev /var/lib/sos/docker_run_76367.sh
I suppose this is because the Large file stuff is outside of the current working directory, right? Note that we do not mount home by default now.
@BoPeng yes. and I already updated the notebook to reflect the change, and added a note. Please pull the latest version.
currently the star
pipeline by itself works though you are welcome to test it. The issue is, that we'd want that step to be triggered by align
so people only have to type one command. It is designed that way, but not working now.
On my machine, I got the error for 2G ram, then also for 11G, and I am increaging to 20G. The error message is clear enough though.
ERROR: [star]: Script killed by docker, probably because of lack of RAM (available RAM=11.2GB, exitcode=137). The script has been saved to /Users/bpeng1/sos/sos-docs/src/examples/.sos/docker_run_82344.sh. To reproduce the error please run:
Yes, I have made a note that it requires 30g+ memory. My computer is 64G. This example is meant to be reproduced so I didn't make it external task. The other RNA seq example is external task with 96G memory requirement not meant to be reproduced.
Come on, I only ordered 32G of RAM for my new mac and thought that it would be enough even for virtual machines... Now I can not run the star step.
Well I guess that is why many people want to use tophat instead. But there is a sparse mode in STAR that will half the memory. I believe I have implemented its interface in this notebook too. I assume users can reserve an interactive cluster node for the full version though? I will make a note
BTW there should be no need to use VM for this one because I think most tools are dockerized. Maybe it fits if you don't use VM? My desktop is a $2600 Linux machine of 40 threads 64G memory ...
$ sos run RNASeqDE.ipynb align --samples SRR1039508
should be fixed.