elsasserlab/minute

For which libraries should output be generated?

marcelm opened this issue · 2 comments

The pipeline currently generates BAM and BigWig outputs for all possible pools and their constituent replicates, even when these are not listed in groups.tsv.

For example, the groups.tsv file for the test dataset lists these normalization pairs:

H3K4m3_SL_CTR  pooled  IN_SL_CTR
H3K4m3_SL_CTR  2       IN_SL_CTR
H3K4m3_2i_CTR  1       IN_2i_CTR
H3K4m3_2i_CTR  2       IN_2i_CTR

From this, the pipeline produces four corresponding .scaled.bw output files:

  • H3K4m3_SL_CTR_pooled.scaled.bw
  • H3K4m3_SL_CTR_rep2.scaled.bw
  • H3K4m3_2i_CTR_rep1.scaled.bw
  • H3K4m3_2i_CTR_rep2.scaled.bw

This makes sense to me.

Now, many more unscaled BigWig files are generated. Of course, the number doubles because the scaled BigWig files are only created for the treatment samples, not for the controls. However, there are more than that. Just looking at the treatments, these are extra:

  • H3K4m3_2i_CTR_pooled.unscaled.bw
  • H3K4m3_SL_CTR_rep1.unscaled.bw

Looking at the Snakefile, it turns out that the pipeline generates files for all possible pools and all the replicates that make up those pools. These are the responsible lines: https://github.com/NBISweden/minute/blob/cb53c4ab4ec9e9a3bb58f9cb3eafb080c6eb24d9/Snakefile#L64-L69.

Which behavior is the desired one?

I would prefer if only the explicitly desired files are generated because it is more efficient and saves space. Changing the above to only generate the desired files reduces the number of steps for the test dataset from 171 to 129.

I did know it behaved like this, and I have been thinking about it. There can be situations in which we want to generate only a subset of these. For example, if there are libraries corresponding to different experiments in the same pool. And in that case you are totally right, only the corresponding unscaled .bw files should be generated.

On the other hand, in general I'd say we want to generate all the unscaled .bw files, and we also want to generate all the relevant scaled .bw files. And I think there is the incidental mistake in defining the groups.tsv where one can forget to specify a few of the scaled samples. So maybe it would be interesting to set the behavior to generating only the ones in groups.tsv, but cross check this and throw some kind of warning if there are libraries in the libraries.tsv that are not used in groups.tsv.

Trying to clean up a bit redundant issues, this has also been fixed in #138