elsasserlab/minute

stats_summary.txt has duplicated input lines

cnluzon opened this issue · 0 comments

Input results get duplicated when same Input is used for different IPs (which is usually the case). It seems one gets as many lines with Input stats for a given barcode as this Input appears in groups.tsv.

I need to look at it in more detail but I think that is the cause (or somehow related to that).

How to reproduce it with the current test data (minute-testdata-0.9):

Duplicate testdata-H3K4m3_R{1,2}.fastq.gz and name it something else: testdata-H3K27m3_R{1,2}.fastq.gz

Then append to the end of in libraries.tsv:

H3K27m3_SL_CTR	1	CATGCTTA	testdata-H3K27m3
H3K27m3_SL_CTR	2	GCACATCT	testdata-H3K27m3
H3K27m3_2i_CTR	1	GGTCCAGA	testdata-H3K27m3
H3K27m3_2i_CTR	2	GTATAACA	testdata-H3K27m3

And groups.tsv:

H3K27m3_SL_CTR	pooled	IN_SL_CTR	group3	mini
H3K27m3_SL_CTR	1	IN_SL_CTR	group3	mini
H3K27m3_2i_CTR	1	IN_2i_CTR	group3	mini
H3K27m3_2i_CTR	2	IN_2i_CTR	group3	mini

You'll get duplicated entries in the stats_summary.txt for the input.

As a side note, Input gets interleaved, row order would be nicer per FASTQ file I think.