/AncestryHMM_pipeline

Undergraduate research Adrian Salguero and Liam McKay under guidance by UCSC Biomolecular Engineering's Russel Corbett-Detig

Primary LanguagePython

AncestryHMM_pipeline

Undergraduate research Adrian Salguero and Liam McKay under guidance by Russ Corbett-Detig PhD

What This Does

  • Takes a VCF (Variant Call Format) SNP data file (NGS read pileup data/genotype data/...)
  • Converts it into a file (like example.panel) for input into russcd/Ancestry_HMM
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA06984 NA06985 NA06986 NA06989 NA06994 NA07000 NA07037 NA07048 NA07051 NA07346 NA07347 NA07357 NA10847 NA10851 NA11829 NA11830 NA11831 NA11832 NA11840 NA11843 NA11881 NA11893 NA11918 NA11919 NA11920 NA11930 NA11992 NA11994 NA11995 NA12003 NA12004 NA12005 NA12006 NA12043 NA12044 NA12045 NA12058 NA12144 NA12154 NA12155 NA12156 NA12234 NA12249 NA12272 NA12273 NA12275 NA12282 NA12283 NA12286 NA12287 NA12340 NA12341 NA12342 NA12347 NA12348 NA12383 NA12400 NA12413 NA12414 NA12489 NA12546 NA12716 NA12717 NA12718 NA12748 NA12749 NA12750 NA12751 NA12760 NA12761 NA12762 NA12763 NA12775 NA12776 NA12812 NA12814 NA12815 NA12828 NA12829 NA12830 NA12842 NA12843 NA12872 NA12873 NA12874 NA12878 NA12889 NA12890 NA12891 NA12892
1 1105366 . T C . PASS AA=T;AC=4;AN=114;DP=3251 GT:DP ./.:0 ./.:0 0/0:107 ./.:0 ./.:0 0/0:25 0/0:30 0/0:31 0/0:57 0/0:69 0/0:53 0/0:225 ./.:0 0/0:6 0/0:79 ./.:0 0/0:110 0/0:79 ./.:0 ./.:0 ./.:0 0/0:43 1/0:54 0/0:7 0/0:89 0/0:87 0/0:98 0/0:83 ./.:0 0/0:62 0/0:1 0/0:4 ./.:0 0/0:97 ./.:0 0/0:115 ./.:0 0/0:77 0/0:8 0/0:63 ./.:0 0/0:92 ./.:0 0/0:1 0/0:1 ./.:0 ./.:0 ./.:0 ./.:0 0/0:76 ./.:0 ./.:0 ./.:0 0/0:41 0/0:35 1/0:135 ./.:0 1/0:116 0/0:6 ./.:0 0/0:147 ./.:0 ./.:0 0/0:4 0/0:40 1/0:23 ./.:0 0/0:1 0/0:2 ./.:0 0/0:7 0/0:1 0/0:90 0/0:49 ./.:0 0/0:6 ./.:0 0/0:82 0/0:31 0/0:7 0/0:9 0/0:7 ./.:0 ./.:0 ./.:0 0/0:176 0/0:3 0/0:81 0/0:67 0/0:156
1 1105411 . G A . PASS AA=G;AC=1;AN=106;DP=2676 GT:DP ./.:0 ./.:0 0/0:92 ./.:0 ./.:0 0/0:23 0/0:17 0/0:37 1/0:61 0/0:60 0/0:47 0/0:126 ./.:0 0/0:5 0/0:79 ./.:1 0/0:87 0/0:76 ./.:0 ./.:0 ./.:0 0/0:26 0/0:50 0/0:3 0/0:92 0/0:79 0/0:93 0/0:73 ./.:0 0/0:43 0/0:1 0/0:2 ./.:0 0/0:53 ./.:0 0/0:81 ./.:0 0/0:67 0/0:5 0/0:58 ./.:0 0/0:59 ./.:0 ./.:0 ./.:0 ./.:0 ./.:0 ./.:0 ./.:0 0/0:58 ./.:0 ./.:0 ./.:0 0/0:34 0/0:20 0/0:101 ./.:0 0/0:107 0/0:7 ./.:0 0/0:121 ./.:0 0/0:1 0/0:1 0/0:31 0/0:28 ./.:0 ./.:0 0/0:2 ./.:0 0/0:8 ./.:0 0/0:59 0/0:49 ./.:0 0/0:6 ./.:0 0/0:51 0/0:29 ./.:4 0/0:7 0/0:3 ./.:0 ./.:0 ./.:0 0/0:168 0/0:4 0/0:84 0/0:47 0/0:150


Source: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/exon/snps/


Converted to:

1 20916748 1 0 0 4 0.20916748000000002 4 0 0 0 6 1 0 0 1 0 1 0 4 3 0 0 0 0 1 0 3 0 0 0 1 44 3 11 21 53 0 0 39 0 26 29 0 0 0 3 1 2 36 0 19 27 0 2 0 0 0 5 0 29 0
1 37098064 1 1 8 0 0.16181316 344 5 2 1 76 49 1 2 14 0 24 2 87 57 8 12 9 121 109 35 3 53 13 8 8 504 4 215 94 291

Dependencies

Quickstart:

  • MUST CREATE A FILE CALLED config.ini FOR RUNTIME PARAMETERS
  • set config.ini with these parameters:

    [DEFAULT]
    allelefreq cutoff = 0.5
    min locus distance = 10
    recombination_rate = 1e-8
    minChrom = 1
    filename = vcfDownloadTestData/CEU.exon.2010_03.genotypes.vcf
    refPopulationNames = NA121,NA122,NA123
    samplePopulationNames = NA124,NA125,NA127,NA128,NA10,NA11,NA120
    refPopulationColumnIndices =
    samplePopulationColumnIndices =
  • Then type in command line: python createAncestryHMM-Input.py
  • Type a name for output at prompt
  • Creates a tsv file with 136 lines
  • To see it in terminal type:
    cat <filename>

Documentation

To use this program, edit config.ini for your input file in VCF format

[allelefreq cutoff]
- (Float) cutoff value for reference panel allele frequency calculation

[min locus distance]
- (Integer) minimum distance between each allele locus

[number of reference panels]
- (Integer) number of reference panel columns in the VCF.

[recombination_rate]
- (Float) estimated recombination rate for recombination probability for Ancestry_HMM input.
- Average recombinations per base pairs

[minChrom]
- (Integer) the minimum amount of chromosomes that must be present in the reference panel alleles to make it through the threshold

[filename]
- (String no quotes) Name of VCF file on local machine

There must be at least a reference and panel specified each in one of two ways

[refPopulationNames]
- (at least 2 Strings no quotes) Names of reference panels in the VCF file.
- Reference panels should be named like guanaco0 guanaco1 guanaco2 etc.
- Example argument: guanaco,vicugna

[samplePopulationNames]
- (at least 2 Strings no quotes) Names of sample panels to be run in Ancestry_HMM.
- Should be named like llama1 llama2 llama3 etc.
- Example argument: llama,alpaca

[refPopulationColumnIndices]
- Column index of reference panels to be run in Ancestry_HMM.
- [Syntax] beginning,end;beginning,end;...
- Example argument: 46,49;50,58;59,64

[samplePopulationColumnIndices]
- Column index of sample panels to be run in Ancestry_HMM.
- [Syntax] beginning,end;beginning,end;...
- Example argument: 69;70,82;83,98