Written by Erick C. Castelli, erick.castelli@unesp.br
Current version: 0.8.2
Phasex is a software written in C++ to automate and compare multiple haplotyping runs using Shapeit4 and/or Beagle 4.
It has been used mainly for haplotyping of HLA and KIR alleles in different studies. Still, it can be used for haplotyping of other genes. It is suitable for datasets of thousands of samples but a limited number of variants (e.g., 5000 samples, 3000 variants). Phasex will automate parallel runs and compare the results, fixing the haplotypes with concordance rates over a threshold to subsequent runs.
First, you need Boost to compile phasex. On Ubuntu, we recommend using "sudo apt-get install libboost-all-dev". On Macos, we recommend use homebrew: brew install boost
Second, we use cmake to generate the make file. Thus, be sure that you have cmake installed.
Download this git repository (using "git clone https://github.com/erickcastelli/phasex"), and follow these instructions:
- enter folder /build
- type "cmake ../source/"
- if everything worked, type "make"
- the binary will be placed in the folder /build
Finally, you need a working copy of Shapeit4 (https://odelaneau.github.io/shapeit4/) and Beagle 4.1 (https://faculty.washington.edu/browning/beagle/b4_1.html)
Plase cite Souza et al. Hla-C genetic diversity and evolutionary insights in two samples from Brazil and Benin. HLA. 2020 Oct;96(4):468-486. doi: 10.1111/tan.13996
PHASEX used shapeit4 to phase bi-allelic variants (considering the PS field) and BEAGLE 4.1 to phase multiallelic variants considering the scaffold inferred by Shapeit4.
Assuming that you have moved the binary to /usr/local/bin or another directory in the PATH, follow these instructions:
phasex
You will see the main functions
phase-ps: to phase variants considering Phase Sets (PS)
recreate: to recreate the final VCF in case you have edited the final results
hp-ps: to recode GATK ReadBackedPhasing format to PS format
phasex hp-ps
You will see the options for the hp-ps function. This method converts GATK ReadBackedPhasing phased VCF to the PS format to be compatible with phasex. If you use WhatsHap, this step is not necessary if you used "--tag=PS" when running WhatsHap.
- Please note that there is a Perl script in /support to parallelize GATK ReadBackedPhasing run and speed up the process.
To use this function, you must type:
phasex hp-ps vcf=THE_VCF_FILE output=THE_NEW_VCF_FILE
Do not use spaces before and after "=". The --quiet mode forces the program not to output any comment.
phasex phase-ps
You will see all the options for this function. This is a typical phasex run:
phasex phase-ps vcf=VCF_FILE_IN_PS_FORMAT iterations=10 replicates=20 shapeit=SHAPEIT4_BINARY beagle=BEAGLE4_JAR
This configuration informs PHASEX to run 20 parallel haplotyping runs (replicates), fixing concordant haplotypes using the threshold value (95% of the runs), and performing these steps 10 times (iteractions). The output folder will be placed next to the input VCF unless modified with "output=".
The threshold for fixing a haplotype is 95% (the default), i.e., a haplotype is fixed as true if 19 runs (20 replicates * 0.95) indicate the same hapotype for a sample. You can modify this using "threshold=".
Phasex uses half of the number of cores of the system unless modified by "threads=".
Phasex will perform 10 iterations, i.e., 10 steps of 20 parallel runs and haplotype comparison. You can modify this using "iterations=" and "replicates=".
After the final iteration (in this case, the 10th iteration), phasex will output the final haplotypes, considering only samples in which the same haplotype was inferred in at least 70% of the replicates in the final run. You can modify this using "select=".
Option scheme is used by Shapeit4. By default, this scheme is 15b,1p,1b,1p,1b,1p,1b,1p,1b,1p,1b,1p,15m. you can modify this using scheme="10b,1p,1b,1p,1b,1p,1b,1p,10m".
Option shapeit_others is used to indicate other shapeit4 parameters.
Option map is used to indicate a genetic map for Shapeit4 (not mandatory). Please download these maps at the Shapeit4 website.
Flag --quiet forces PHASEX not to output any progress or comment.
Flag --biallelic forces PHASEX to deal only with biallelic variants, using only Shapeit4.
The output structure is as follows:
phasex.log: Record all the parameters and some quality-control information
results.vcf: This is the final PHASED VCF file. Only the samples passing the select threshold are included in this file (by default: 70% of the runs presenting the same haplotype in the final run).
results.freq: The haplotypes, their global count, and frequency
sample_list.txt: The list of samples that passed the SELECT threshold.
/shapeit : the shapeit results for each iteration, and the final results in "results.txt"
/shapeit/results.txt: the final results when using shapeit4. This file presents the following format:
Sample h1 h2 Freq(1) Info(1) Freq(2) Info(2) Freq(n) Info(n) Status
- Sample: the sample id
- h1: first haplotype
- h2: second haplotype
- Freq(n): proportion of parallel runs indicating this pair of haplotypes in iteration N
- info(n): "-" if under the threshold, "def" if fixed for the next iteration
- Status: "-" if not this haplotype pair is under the SELECT threshold, "pass" if it is above the SELECT threhold. Only the samples with "pass" are included in the final VCF.
/beagle : the beagle results for each iteration, and the final results in "results.txt"
/beagle/results.txt: the final results when using Beagle. Same format as for Shapeit.
/source : the files used for the haplotyping procedure.
Files /shapeit/results.txt contains all the haplotypes detected for each sample (when phasing only bi-allelic variants) and /beagle/results.txt when phasing multi-allelic variants. You can edit this file excluding samples by replacing "pass" under Status by "-", or force the inclusion of a sample by changing "-" for "pass". If you have edited this file, you should run the following command to recreate the final VCF file:
phasex recreate input=PATH_TO_THE_OUTPUT_PHASEX_FOLDER
- Castelli et al. Immunogenetics of resistance to SARS-CoV-2 infection in discordant couples. MedRxiv (doi 10.1101/2021.04.21.21255872)
- Naslavsky et al. Whole-genome sequencing of 1,171 elderly admixed individuals from the largest Latin American metropolis (São Paulo, Brazil). BioRxiv (doi 10.1101/2020.09.15.298026)
- Sonon et al. Peripheral spectrum neurological disorder after arbovirus infection is associated with HLA-F variants among Northeastern Brazilians. Infect Genet Evol. 2021 Apr 8;92:104855. doi: 10.1016/j.meegid.2021.104855
- Sonon et al. Human leukocyte antigen (HLA)-F and -G gene polymorphisms and haplotypes are associated with malaria susceptibility in the Beninese Toffin children. Infect Genet Evol. 2021 Mar 27;92:104828. doi: 10.1016/j.meegid.2021.104828
- Weiss et al. KIR2DL4 genetic diversity in a Brazilian population sample: implications for transcription regulation and protein diversity in samples with different ancestry backgrounds. Immunogenetics. 2021 Jun;73(3):227-241. doi: 10.1007/s00251-021-01206-9
- Souza et al. Hla-C genetic diversity and evolutionary insights in two samples from Brazil and Benin. HLA. 2020 Oct;96(4):468-486. doi: 10.1111/tan.13996
- Ramos et al. A large familial cluster and sporadic cases of frontal fibrosing alopecia in Brazil reinforce known human leucocyte antigen (HLA) associations and indicate new HLA susceptibility haplotypes. J Eur Acad Dermatol Venereol. 2020 Oct;34(10):2409-2413. doi: 10.1111/jdv.16629
- and many others...