For which libraries should output be generated?
marcelm opened this issue · 2 comments
The pipeline currently generates BAM and BigWig outputs for all possible pools and their constituent replicates, even when these are not listed in groups.tsv
.
For example, the groups.tsv
file for the test dataset lists these normalization pairs:
H3K4m3_SL_CTR pooled IN_SL_CTR
H3K4m3_SL_CTR 2 IN_SL_CTR
H3K4m3_2i_CTR 1 IN_2i_CTR
H3K4m3_2i_CTR 2 IN_2i_CTR
From this, the pipeline produces four corresponding .scaled.bw
output files:
H3K4m3_SL_CTR_pooled.scaled.bw
H3K4m3_SL_CTR_rep2.scaled.bw
H3K4m3_2i_CTR_rep1.scaled.bw
H3K4m3_2i_CTR_rep2.scaled.bw
This makes sense to me.
Now, many more unscaled BigWig files are generated. Of course, the number doubles because the scaled BigWig files are only created for the treatment samples, not for the controls. However, there are more than that. Just looking at the treatments, these are extra:
H3K4m3_2i_CTR_pooled.unscaled.bw
H3K4m3_SL_CTR_rep1.unscaled.bw
Looking at the Snakefile, it turns out that the pipeline generates files for all possible pools and all the replicates that make up those pools. These are the responsible lines: https://github.com/NBISweden/minute/blob/cb53c4ab4ec9e9a3bb58f9cb3eafb080c6eb24d9/Snakefile#L64-L69.
Which behavior is the desired one?
I would prefer if only the explicitly desired files are generated because it is more efficient and saves space. Changing the above to only generate the desired files reduces the number of steps for the test dataset from 171 to 129.
I did know it behaved like this, and I have been thinking about it. There can be situations in which we want to generate only a subset of these. For example, if there are libraries corresponding to different experiments in the same pool. And in that case you are totally right, only the corresponding unscaled .bw
files should be generated.
On the other hand, in general I'd say we want to generate all the unscaled .bw
files, and we also want to generate all the relevant scaled .bw
files. And I think there is the incidental mistake in defining the groups.tsv
where one can forget to specify a few of the scaled samples. So maybe it would be interesting to set the behavior to generating only the ones in groups.tsv
, but cross check this and throw some kind of warning if there are libraries in the libraries.tsv
that are not used in groups.tsv
.