elsasserlab/minute

Library size estimation not showing up in stats

cnluzon opened this issue · 2 comments

I thought that Picard only generated an empty value for library size estimation when the sample is very small, but for some reason I have always seen a NA value in the final stats, so it either is a) problem in parsing, b) a problem by Picard.

It seems that Picard does not estimate library size for single-end mark duplicates. Even though our data is paired-end, we mark duplicates using only the first mate, and Je does the same thing as Picard I am assuming, so that field is always left empty.

In this case I would remove that field from the summary stats, since it's always going to be a NA value. However this is a number we are usually interested in (@simonelsasser can you confirm this?).

I am not quite sure why they do not estimate this for single-end data. For what they say here this seems like a probability model that should work also with single-end data. But maybe there is something I'm missing here. I haven't been able to find an explicit answer to this. One option is solve the equation ourselves (I have some translation of Picard code to do this to python - Picard has MIT license so that would be legit to do anyway), but then I'd need to be sure that it is correct to do so from our duplicate marks. In our case, it's only that we choose to mark the duplicates this way, but the data is still paired-end, so any experimental procedure that influences these probabilities would still work the same.

Another possibility, maybe easier, is just add other library complexity estimators, such as preseq as complementary information.

This was fixed in PR #160