liftOver is a commandline tool for Linux and Mac OSX, which is used to bring genomic data to the same reference build. It is provided by the University of California, Santa Cruz (UCSC) as part of their Genome Browser.
liftOver has three use cases:
- Converting the genome position from one genome assembly to another
- Converting a dbSNP rs ID from one build to another
- Converting both the genome position, and dbSNP rs ID over different versions.
The liftOver does not work natively with genotype data stored in
the commonly used plink data formats. The Abecasis Lab at the
University of Michigan has developed a rudimentary wrapper in the
form of a python script to work with plink data stored in the PED
and
MAP
formats.
This repository provides a more polished commandline tool for this task: liftOverPlink
To use this tool, you will need the liftOver binary along with the
appropriate chain files
which tell liftOver how to convert
between different genome builds. These can be found at the
UCSC Genome Bioinformatics Downloads Page, by following the
"liftOver" links for your organism of choice.
liftOverPlink is modified from the liftMap.py script provided by the Abecasis Lab to:
- Take a user specified chain file.
- To look for liftOver in
$PATH
or a user specified location, rather than the original developer's specific download location. - Provide usage documentation and help messages.
- To clean up the left over
BED
files to avoid confusion.
The core functionality and algorithms remain identical to liftMap.py.
usage: liftOverPlink.py [-h] -m MAPFILE [-p PEDFILE] [-d DATFILE] -o PREFIX -c
CHAINFILE [-e LIFTOVEREXECUTABLE]
liftOverPlink.py converts genotype data stored in plink's PED+MAP format from
one genome build to another, using liftOver.
optional arguments:
-h, --help show this help message and exit
-m MAPFILE, --map MAPFILE
The plink MAP file to `liftOver`.
-p PEDFILE, --ped PEDFILE
Optionally remove "unlifted SNPs" from the plink PED
file after running `liftOver`.
-d DATFILE, --dat DATFILE
Optionally remove "unlifted SNPs" from a data file
containing a list of SNPs (e.g. for --exclude or
--include in `plink`)
-o PREFIX, --out PREFIX
The prefix to give to the output files.
-c CHAINFILE, --chain CHAINFILE
The location of the chain file to provide to
`liftOver`.
-e LIFTOVEREXECUTABLE, --bin LIFTOVEREXECUTABLE
The location of the `liftOver` executable.
liftOverPlink is simply a wrapper around liftOver;
it works by converting the the plink MAP
files to the BED
format
liftOver expects (Note: this is completely unrelated to plink's
BED
format!!). liftOver then updates the information in this BED
file using the information in the provided chain file
, and then
liftOverPlink converts this BED
file back to a
MAP
file.
During the conversion process, some SNPs may fail to be converted
between builds. There are a number of reasons a SNP may fail to be
"lifted", which boil down to that SNP not existing in the genome build
you are converting to. These are referred to as "unlifted" SNPs and are
written out to a file with the .unlifted
extension, with the prefix
specified for the output files by the --out
argument.
liftOverPlink can also optionally remove unlifted
SNPs from the PED
file (this is highly recommended), and can update
any other file containing a list of rs IDs (passed in to the --dat
argument).
If the liftOver executable is not in your $PATH
then you can
optionally specify a location for liftOverPlink to
call it from using the --bin
argument.
In the case of lifting from build hg19 to hg38, some of the SNPs that
get lifted have "bad" Chromosomes, like "22_KI270879v1_alt" for example.
rmBadLifts is designed to handle this case: it removes
these SNPs from the MAP
file, and creates a file (--log
) containing
a list of these rsIDs:
Usage
usage: rmBadLifts.py [-h] -m MAPFILE -o OUTFILE -l LOGFILE
rmBadLifts.py goes through a plink MAP file, identifies and removes SNPs not
on chromosomes 1 to 22
optional arguments:
-h, --help show this help message and exit
-m MAPFILE, --map MAPFILE
MAP file to process.
-o OUTFILE, --out OUTFILE
New MAP file to output.
-l LOGFILE, --log LOGFILE
File to output the bad rsIDs to.
Workflow
The workflow when handling these SNPs is complicated:
- Create a "lifted"
MAP
file. - Remove bad lifted SNPs from that
MAP
file. - Generate a good
PED
file by excluding "unlifted", and bad "lifted" SNPs. - Combine the fixed
PED
file, with the fixedMAP
file.
On Linux this would look something like:
python liftOverPlink.py --map genotypes --out lifted --chain hg19ToHg38.over.chain.gz
python rmBadLifts.py --map lifted.map --out good_lifted.map --log bad_lifted.dat
cut -f 2 bad_lifted.dat > to_exclude.dat
cut -f 4 lifted.unlifted | sed "/^#/d" >> to_exclude.dat
# Note: this will clobber the lifted MAP file generated by `liftOverPlink`:
plink --file genotypes --recode --out lifted --exclude bad_lifted.dat
plink --ped lifted.ped --map good_lifted.map --recode --out final