Nextflow pipeline (DSL2) that runs fstatistics with AdmixTools, taking different input types (VCF, PLINK, EIGENSTRAT).
- Consider this tree to set popA - popD for qpDstat:
- Consider this tree to set popA - popC for qp3Pop:
-v 0.1.0
- Supports compressed VCF file as input.
- Supports PLINK files as input.
- Supports EIGENSTRAT files as input.
- Option to run qpDstat for D or f4.
- Option to run qp3Pop for inbreed or outgroup.
- Makes all possible combination of treest for testing.
- Scalability and reproducibility via a Nextflow-based framework.
- Ubuntu 20.04.4 LTS
- macOS
Requirement | Version | Required Commands * |
---|---|---|
bcftools | 1.17 | reheader,view |
Nextflow | 23.04.2.5870 | nextflow |
plink | 1.19 ) | plink |
Poseidon | v1.4.0.2 | trident |
AdmixTools | v651 and v980 | qp3Pop and qpDstat |
* These commands must be accessible from your $PATH
(i.e. you should be able to invoke them from your command line).
Download nf-haplotype-selection from Github repository:
git clone https://github.com/jbv2/run_admixtools.git
To test run_admixtools
execution using test data, run:
nextflow run main.nf -profile <test_qpDstat,test_qp3Pop>
This pipeline can use nf-core Institutional profiles. For DAG EVA people, please use:
nextflow run main.nf -profile eva,archgen,<test_qpDstat,test_qp3Pop>
To run run_admixtools
go to the pipeline directory and execute:
nextflow run main.nf \
--inputVCF "<path to VCF input>" \ # Needs --input_type "vcf" --samples and --half_call
--samples <Path to file with new sample names> \ # txt file with id famid_id needed when input is VCF.
--half_call "<h>" \ # Plink option when VCF to PLINK on how to treat how to deal with '0/.'. See https://www.cog-genomics.org/plink/1.9/input
--inputbed "<path to PLINK bed>" \ # Needs input_type = "plink"
--inputgeno "<Path to EIGENSTRAT geno>" \ # Needs input_type = "eigenstrat"
--input_type <'vcf','plink','eigenstrat'> \ # Select your input type
--popA "<Vector for population(s) in A position>" \ # Vector. Example: "CEU YRI"
--popB "<Vector for population(s) in B position>" \ # Vector. Example: "CEU YRI"
--popC "<Vector for population(s) in C position>" \ # Vector. Example: "CEU YRI"
--popD "<Vector for population(s) in D position>" \ # Vector. Example: "CEU YRI". Not needed in qp3Pop.
--run_qpDstat <'true','false'> \ # Can not be used with --run_qp3Pop. Needs --f4mode
--f4mode <"YES","NO"> \ # qpDstat option
--run_qp3Pop <'true','false'> \ # Can not be used with --run_qpDstat. Needs --inbreed.
--inbreed <"YES","NO"> \ # qp3Pop option
--outdir <path to results> \ # Outdir
-
If you have one input type, you don't need to provide the others.
-
If you select PLINK as input,
.fam
and.bim
must be in the same dir as the.bed
file. -
If you select EIGENSTRAT as input,
.ind
and.snp
must be in the same dir as the.geno
file. -
If you select VCF as input, provide:
- --samples "Two columns text file with the structure: ID FAMID_ID". See
test/reference/new_sample_names.txt
for deatils. - --half_call: Plink option when VCF to PLINK on how to treat how to deal with '0/.'. Default: haploid.
- 'haploid'/'h': Treat half-calls as haploid/homozygous (the PLINK 1 file format does not distinguish between the two). This maximizes similarity between the VCF and BCF2 parsers.
- 'missing'/'m': Treat half-calls as missing.
- 'reference'/'r': Treat the missing part as reference.
- --samples "Two columns text file with the structure: ID FAMID_ID". See
You can also copy/rename test/my_project.config
, replace with your information and run it with:
nextflow run main.nf -c </PATH/TO/YOUR/my_projecy.config> -profile my_project_f4
It can be as easy as you want!
- If you have a VCF to start:
- A
VCF
file compressed with a.vcf.gz
extension. AVCF
file mainly contains meta-information lines, a header and data lines with information about each position. The header names the eigth mandatory columnsCHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO
.
For more information about the VCF format, please go to the next link: Variant Call Format
Example line(s):
##fileformat=VCFv4.2
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 sample2 sample3
21 9411773 rs867796868 C T . PASS AC=0;AN=182 GT 0|0 0|0 0|0 0|0
21 10979896 rs117219976 A G . PASS AC=1;AN=182 GT 0|1 0|0 0|0
22 17176028 rs2845365 T G . PASS AC=58;AN=182 GT 1|0 0|0 0|1
- A two columns file, separated by space with old_name new_name.
sample1 FAM1_sample1
sample2 FAM2_sample2
sample3 FAM3_sample3
- If you have bfile (PLINK) to start:
- A binary
.bed
. - A
.bim
file.
Example line(s):
1 rs74512038 0 778597 T C
1 rs553642122 0 790021 T C
1 rs4951859 0 794299 G C
...
- A
.fam
file.
Example line(s):
FAM1 sample1 0 0 0 -9
FAM2 sample2 0 0 0 -9
FAM3 sample3 0 0 0 -9
...
- If you have EIGENSTRAT to start:
- A binary
.geno
. - A
.snp
file.
Example line(s):
rs74512038 1 0 778597 T C
rs553642122 1 0 790021 T C
rs4951859 1 0 794299 G C
...
- A
.ind
file.
Example line(s):
sample1 U FAM1
sample2 U FAM2
sample3 U FAM3
...
For qp3Pop
- A
.OUT.log
file.
Example line(s):
qp3Pop: parameter file: test.parfile
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
genotypename: test.geno
snpname: test.snp
indivname: test.ind
popfilename: test.ftest.txt
inbreed: YES
## qp3Pop version: 651
inbreed set YES
test.snp: genetic distance set from physical distance
*** warning: first snp . is number. perhaps you are using .map format
number of triples 3
nplist: 3
snps: 59920
number of blocks for block jackknife: 9
Source 1 Source 2 Target f_3 std. err Z SNPs
result: CHB PEL Maya 0.001641 0.011348 0.145 53604
- A
.ftest.txt
file with combinations.
Example line(s):
CHB PEL Maya
CHB PEL Mixe
CHB PEL Nahua
- A
.parfile
file.
Example line(s):
genotypename: test.geno
snpname: test.snp
indivname: test.ind
popfilename: test.ftest.txt
inbreed: YES
For qpDstat
- A
.OUT.log
file.
Example line(s):
qpDstat: parameter file: test.parfile
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
genotypename: test.geno
snpname: test.snp
indivname: test.ind
popfilename: test.ftest.txt
printsd: YES
f4mode: YES
## qpDstat version: 980
inbreed set NO
test.snp: genetic distance set from physical distance
*** warning: first snp . is number. perhaps you are using .map format
number of quadruples 6
0 CHB 1
1 PEL 4
2 Maya 4
3 Mixe 5
4 Nahua 13
jackknife block size: 0.050
snps: 59920 indivs: 27
number of blocks for block jackknife: 9
nrows, ncols: 27 59920
result: CHB PEL Maya Mixe -0.000913 0.003262 -0.280 3756 3811 59905
result: CHB PEL Maya Nahua 0.002345 0.002936 0.799 3875 3734 59905
result: CHB PEL Mixe Maya 0.000913 0.003262 0.280 3811 3756 59905
- A
.result.tsv
with header and separated by TAB.
PopA PopB PopC PopD f4 stderr Zscore
CHB PEL Maya Mixe -0.000913 0.003262 -0.280
CHB PEL Maya Nahua 0.002345 0.002936 0.799
CHB PEL Mixe Maya 0.000913 0.003262 0.280
CHB PEL Mixe Nahua 0.003258 0.001824 1.787
CHB PEL Nahua Maya -0.002345 0.002936 -0.799
CHB PEL Nahua Mixe -0.003258 0.001824 -1.787
- A
.ftest.txt
file with combinations.
Example line(s):
CHB PEL Maya Mixe
CHB PEL Maya Nahua
CHB PEL Mixe Maya
CHB PEL Mixe Nahua
CHB PEL Nahua Maya
CHB PEL Nahua Mixe
- A
.parfile
file.
Example line(s):
genotypename: test.geno
snpname: test.snp
indivname: test.ind
popfilename: test.ftest.txt
printsd: YES
f4mode: YES
Under the hood run_admixtools
uses some coding tools, please include the following ciations in your work:
- Narasimhan, V., Danecek, P., Scally, A., Xue, Y., Tyler-Smith, C., & Durbin, R. (2016). BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data. Bioinformatics, 32(11), 1749-1751.
- Patterson, Nick, et al. "Ancient admixture in human history." Genetics 192.3 (2012): 1065-1093.
- Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., ... & Sham, P. C. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics, 81(3), 559-575.
- https://github.com/poseidon-framework/poseidon-hs
If you have questions, requests, or bugs to report, please email judith.vballesteros@gmail.com
Judith Ballesteros-Villascán judith.vballesteros@gmail.com