BiocPy/GenomicRanges

Wrong dtypes used for seqnames in `GenomicRanges._sanitize_seqnames` and should use max value not len of unique values as `num_uniq`

Closed this issue · 2 comments

Correct code:

    def _sanitize_seqnames(self, seqnames, seqinfo):
        if self._reverse_seqindex is None:
            self._build_reverse_seqindex(seqinfo)

        if not isinstance(seqnames, np.ndarray):
            seqnames = np.asarray([self._reverse_seqindex[x] for x in seqnames])

            num_uniq = np.max(seqnames)
            if num_uniq < 2**8:
                seqnames = seqnames.astype(np.uint8)
            elif num_uniq < 2**16:
                seqnames = seqnames.astype(np.uint16)
            elif num_uniq < 2**32:
                seqnames = seqnames.astype(np.uint32)

        return seqnames

Max value should be used since this is the value of concern, e.g. you could have 4 unique values of [1, 2, 3, 257] 257 would be wrapped around 0 in this case. Max value avoids this.

You need to use uint types as int types are singed so their max value is half of your if statements. You could reduce the exponents by one but you will never have negative values for indices so uint types make more sense.

Thank you, fixing this in the attached PR and this issue will close automatically once the PR is merged.

@jkanche you didn't change the np.int# types to np.uint# types. Using np.int causes wrapping around zero between 127 and 256 and 32,767 and 65,536 seqnames (with guaranteed wrapping after 2,147,483,647 seqnames instead of the intended 4,294,967,296)