Python scripts to convert phased genotype data from SHAPEIT2 format (.haps
/.sample
) to plain text PLINK format (.ped
/.map
) while
keeping the phasing information intact.
- The first step is to convert from SHAPEIT2 format (
.haps
/.sample
) to IMPUTE2 format (.haps
/.legend
/.sample
). - The second step is to convert from IMPUTE2 format (
.haps
/.legend
/.sample
) to plain text PLINK format (.ped
/.map
). This step uses the SHAPEIT2.sample
file from the first step and thecut
command in UNIX/Linux/Mac OS X.
The script to perform the first step, convert_shapeit2_to_impute2.py
, is based on a script from PyPedia. The second script, convert_impute2_to_PEDMAP.py
, is written by me, because the one(s) from PyPedia (and other places) would start to use swap space for large datasets and would, therefore, become
excruciatingly slow; my script avoids using swap and performs the conversion in memory for any number of individuals.
Data has to be separated by chromosome; then, run a for ((i = 1; i <= 22; i++)); do ...; done
loop and perform the following steps in the loop.
python convert_shapeit2_to_impute2.py chr${i}.haps chr${i}.sample temp${i}.haps temp${i}.legend temp${i}.sample
python convert_impute2_to_PEDMAP.py temp${i}.haps temp${i}.legend chr${i}.sample chr${i} ${i}
rm temp${i}.*
You can comment out the third step if you would like to keep the output in IMPUTE2 format as well.
You might need to edit convert_impute2_to_PEDMAP.py
to make it consistent with your phased SHAPEIT2 .sample
file in terms of the number of columns.