Wrong dtypes used for seqnames in `GenomicRanges._sanitize_seqnames` and should use max value not len of unique values as `num_uniq`
Closed this issue · 2 comments
DiDeoxy commented
Correct code:
def _sanitize_seqnames(self, seqnames, seqinfo):
if self._reverse_seqindex is None:
self._build_reverse_seqindex(seqinfo)
if not isinstance(seqnames, np.ndarray):
seqnames = np.asarray([self._reverse_seqindex[x] for x in seqnames])
num_uniq = np.max(seqnames)
if num_uniq < 2**8:
seqnames = seqnames.astype(np.uint8)
elif num_uniq < 2**16:
seqnames = seqnames.astype(np.uint16)
elif num_uniq < 2**32:
seqnames = seqnames.astype(np.uint32)
return seqnames
Max value should be used since this is the value of concern, e.g. you could have 4 unique values of [1, 2, 3, 257] 257 would be wrapped around 0 in this case. Max value avoids this.
You need to use uint types as int types are singed so their max value is half of your if statements. You could reduce the exponents by one but you will never have negative values for indices so uint types make more sense.
jkanche commented
Thank you, fixing this in the attached PR and this issue will close automatically once the PR is merged.