CAST-genomics/haptools

exclude haplotypes if they don't fit within a `--region` perfectly

Closed this issue · 0 comments

Current behavior

Users can specify a --region for the transform or simphenotype command. This will subset to just the haplotypes within a provided region.

But what if a haplotype doesn't exactly fit within the region? Currently, it still gets included, but its variants are truncated so that it contains only those variants that fit within the region.

# exclude variants outside the desired region
hap_region = hap_id
if region:
hap_region = hap_id + ":" + region_positions
# fetch region
# we already know that each line will start with a V, so we don't
# need to check that
for line in haps_file.fetch(hap_region):

This behavior is undesirable for a number of reasons. Not least b/c it's probably kinda unexpected.

Desired behavior

If the haplotype doesn't exactly fit within the region, we should just exclude it here:

for line in haps_file.fetch(region):

We might already do this, actually. It just depends on what the default behavior for pysam.TabixFile.fetch() is. If it's similar to the tabix command, then it should return anything that overlaps, regardless of whether it fits perfectly.