deeptools/py2bit

Bases do not sum to one when sequence has no N's

Closed this issue · 5 comments

Hi,

I'm running extraction of GC contents from 10bp bins and got curious when the percentages where not in [0.1, 0.2, 0.3, ...]. I get that N bases could cause this, but it seems that it happens even when the sequence only has ACGTs.

An example using HG38 reference:

# 10 bases - All ACGT
tb.sequence("chr1", 792530, 792540)  
>> 'CAACCTTCAG'

# Counts sum to 10
tb.bases("chr1", 792530, 792540, False)
>> {'A': 3, 'C': 4, 'T': 2, 'G': 1} 

# Would expect 3/10=0.3, 4/10=0.4, etc.
tb.bases("chr1", 792530, 792540) 
>> {'A': 0.25, 'C': 0.3333333333333333, 'T': 0.16666666666666666, 'G': 0.08333333333333333}

# Where did the remaining 0.1666 go?
sum(tb.bases("chr1", 792530, 792540).values()) 
>> 0.8333333333333333

What am I missing?

Edit: Is it dividing by 12 instead of 10? And if so, why?

Best,
Ludvig

I guess it happens in *twobitBasesWorker where you divide the counts with len but len is defined as:

len = end - start + (start % 4)

Oh, good catch, that's...I'm not sure why I did that. Let me look into it.

Now I see why I did that. Yeah, that that variable is used for the length of the relevant sequence is very obviously wrong. The modulus 4 thing is due to sequence being stored in 4-base chunks in the files, but you're absolutely correct that that should NOT be used when computing the per-base fractions.

I guess, it's mostly problematic with small bins like this, but great that it gets fixed! :-)

Yup, the maximum error is 3 bases, so it should be a small error for larger bins. For a 1-base bin, though... I'll try to get the CI fixed today so I can properly get the testing working again.