/liftOverPlink

A wrapper for liftOver for converting plink genotype data between different genome reference builds

Primary LanguagePythonGNU General Public License v2.0GPL-2.0

liftOverPlink

liftOver is a commandline tool for Linux and Mac OSX, which is used to bring genomic data to the same reference build. It is provided by the University of California, Santa Cruz (UCSC) as part of their Genome Browser.

liftOver has three use cases:

  1. Converting the genome position from one genome assembly to another
  2. Converting a dbSNP rs ID from one build to another
  3. Converting both the genome position, and dbSNP rs ID over different versions.

The liftOver does not work natively with genotype data stored in the commonly used plink data formats. The Abecasis Lab at the University of Michigan has developed a rudimentary wrapper in the form of a python script to work with plink data stored in the PED and MAP formats.

This repository provides a more polished commandline tool for this task: liftOverPlink

To use this tool, you will need the liftOver binary along with the appropriate chain files which tell liftOver how to convert between different genome builds. These can be found at the UCSC Genome Bioinformatics Downloads Page, by following the "liftOver" links for your organism of choice.


Change Log

liftOverPlink is modified from the liftMap.py script provided by the Abecasis Lab to:

  • Take a user specified chain file.
  • To look for liftOver in $PATH or a user specified location, rather than the original developer's specific download location.
  • Provide usage documentation and help messages.
  • To clean up the left over BED files to avoid confusion.

The core functionality and algorithms remain identical to liftMap.py.


Usage

usage: liftOverPlink.py [-h] -m MAPFILE [-p PEDFILE] [-d DATFILE] -o PREFIX -c
                        CHAINFILE [-e LIFTOVEREXECUTABLE]

liftOverPlink.py converts genotype data stored in plink's PED+MAP format from
one genome build to another, using liftOver.

optional arguments:
  -h, --help            show this help message and exit
  -m MAPFILE, --map MAPFILE
                        The plink MAP file to `liftOver`.
  -p PEDFILE, --ped PEDFILE
                        Optionally remove "unlifted SNPs" from the plink PED
                        file after running `liftOver`.
  -d DATFILE, --dat DATFILE
                        Optionally remove "unlifted SNPs" from a data file
                        containing a list of SNPs (e.g. for --exclude or
                        --include in `plink`)
  -o PREFIX, --out PREFIX
                        The prefix to give to the output files.
  -c CHAINFILE, --chain CHAINFILE
                        The location of the chain file to provide to
                        `liftOver`.
  -e LIFTOVEREXECUTABLE, --bin LIFTOVEREXECUTABLE
                        The location of the `liftOver` executable.

Details

liftOverPlink is simply a wrapper around liftOver; it works by converting the the plink MAP files to the BED format liftOver expects (Note: this is completely unrelated to plink's BED format!!). liftOver then updates the information in this BED file using the information in the provided chain file, and then liftOverPlink converts this BED file back to a MAP file.

During the conversion process, some SNPs may fail to be converted between builds. There are a number of reasons a SNP may fail to be "lifted", which boil down to that SNP not existing in the genome build you are converting to. These are referred to as "unlifted" SNPs and are written out to a file with the .unlifted extension, with the prefix specified for the output files by the --out argument.

liftOverPlink can also optionally remove unlifted SNPs from the PED file (this is highly recommended), and can update any other file containing a list of rs IDs (passed in to the --dat argument).

If the liftOver executable is not in your $PATH then you can optionally specify a location for liftOverPlink to call it from using the --bin argument.


Addtional Tools

Removing bad lifted SNPs

In the case of lifting from build hg19 to hg38, some of the SNPs that get lifted have "bad" Chromosomes, like "22_KI270879v1_alt" for example. rmBadLifts is designed to handle this case: it removes these SNPs from the MAP file, and creates a file (--log) containing a list of these rsIDs:

Usage

usage: rmBadLifts.py [-h] -m MAPFILE -o OUTFILE -l LOGFILE

rmBadLifts.py goes through a plink MAP file, identifies and removes SNPs not
on chromosomes 1 to 22

optional arguments:
  -h, --help            show this help message and exit
  -m MAPFILE, --map MAPFILE
                        MAP file to process.
  -o OUTFILE, --out OUTFILE
                        New MAP file to output.
  -l LOGFILE, --log LOGFILE
                        File to output the bad rsIDs to.

Workflow

The workflow when handling these SNPs is complicated:

  1. Create a "lifted" MAP file.
  2. Remove bad lifted SNPs from that MAP file.
  3. Generate a good PED file by excluding "unlifted", and bad "lifted" SNPs.
  4. Combine the fixed PED file, with the fixed MAP file.

On Linux this would look something like:

python liftOverPlink.py --map genotypes --out lifted --chain hg19ToHg38.over.chain.gz
python rmBadLifts.py --map lifted.map --out good_lifted.map --log bad_lifted.dat
cut -f 2 bad_lifted.dat > to_exclude.dat
cut -f 4 lifted.unlifted | sed "/^#/d" >> to_exclude.dat 
# Note: this will clobber the lifted MAP file generated by `liftOverPlink`:
plink --file genotypes --recode --out lifted --exclude bad_lifted.dat 
plink --ped lifted.ped --map good_lifted.map --recode --out final