This Python script queries an ENSEMBL database, and infers mammalian diversity in the region around
The script takes three files:
- An alleles file giving the chromosome, position, and both expected alleles.
- A bed file with information on repeats in the region. This file is not actually used in this step but the last three columns are:
- Number of bases in a window (in this case 104 bases) that are covered by simple repeats.
- Number of bases in a window (in this case 104 bases) that are unmasked by the Heng Li 35mer mask.
- Is the target base masked by unmasked by the Heng Li 35mer mask? (0 or 1)
- Sites from the "ape" file, as defined here: https://dx.doi.org/10.17617/3.5h
$ head snps.alleles
1 752675 T C
1 812425 G A
1 812751 T C
1 813034 A G
1 814609 A T
1 821477 A G
1 821947 G A
1 822775 G A
1 826240 C A
1 834198 T C
$ head snps.rpt_map
1 752674 752675 0 54 1
1 812424 812425 0 33 1
1 812750 812751 0 12 1
1 813033 813034 0 12 1
1 814608 814609 0 2 1
1 821476 821477 0 26 1
1 821946 821947 0 35 1
1 822774 822775 0 33 1
1 834197 834198 0 104 1
1 834359 834360 0 67 1
$ head snps.ape
1 812425 G G G A A A G R G
1 812751 T C C C C C T T C
1 813034 A G G G G G R G A
1 814609 A A A A T T A W W
1 821477 A G G A A N A A G
1 821947 G G G G G N R G G
1 822775 G G G G G G A A G
1 826240 C T T T T T C M M
1 834198 T T T N T T C C T
1 835831 G G G N G G G G A
python3.5 query_ensembl_for_sites.py 25 52 \
snps.alleles snps.rpt_map snps.ape \
> snps.report.txt
This file contains a lot of debugging information - but to get a table of capture sites, grep for lines that include REPORT.