/MIPSTR

Calling short tandem repeat genotypes using molecular inversion probes.

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

MIPSTR v2 code package

The paper using this code is now out in Genome Research, doi:10.1101/gr.231753.117.

Originally developed (v1) by Keisha Carlson and Peter Sudmant (described in Carlson et al. 2015 Genome Research).

Updated by Maximilian Press and Ashley Hall (v2).

this code will not be supported, other than to deal with obvious bugs. direct inquiries to queitsch@uw.edu. this code is provided in the interest of openness and reproducibility. it is not intended to be extended beyond the application of the authors (Press et al. 2018, ms in revision), though users are certainly welcome to do so should they desire.

dependencies (in path):

  • Linux OS
  • python 2.7
  • NumPy 1.6.1
  • R >=3
  • bwa (with MEM)
  • samtools

DIRS:

  • code/ : contains code.
  • full_mip_design/ : contains data (output from mipgen or generated by user) to make synthetic references, perform genotype calling, etc.
  • references/ : empty now, will fill with synthetic reference fastas when they are made.

FILES:

*MIPSTR_runner.sh: runs code procedurally in a fashion that should generate STR genotypes given MIP designs from mipgen and sequencing data in fastq format.

Fastq data for testing: files too large for github, but can be downloaded here

  • Cvi-NewMIP_S4_L001_R1_001.fastq.gz: forward 250bp reads from a MIPSTR library (a pilot experiment)
  • Cvi-NewMIP_S4_L001_R2_001.fastq.gz: reverse 50bp reads from a MIPSTR library (a pilot experiment)

Place these files in the main MIPSTR directory, which is where the runner script expects them to be.

notes on running the code:

This code is provided as an example of how the pipeline can be run, on a single sample. In order to run on multiple samples, in any practical situation, you will likely need to run parallel jobs for various parts of the pipeline. Mapping especially takes a long time (several hours per sample on a single thread), so if you run the samples sequentially you could be waiting a long time. I have packaged the code in such a way that with minimal modifications multiple jobs could be run at the same time. I ran this analysis on a large Linux cluster with multiple cores available, so it was straightforward to run many samples in parallel. However, I did not want to assume that everyone has this system architecture, and I wanted to keep this example simple, and thus we have the present code.

The initial steps of creating and indexing references is an up-front cost for a given set of MIP probes, so would not have to be repeated in running more libraries that were generated the same way through the pipeline.

To run the pipeline on the provided one-sample dataset, simply run the following in a bash shell: $ sh MIPSTR_runner.sh

This takes about 8hrs on my machine. The code will detect and handle any fastq.gz R1 and R2 file pairs in the main directory, and try to find reads targeted to the STRs indicated in the full_mip_design/ directory. Output will appear in a text file in the main directory when the run finishes. It is currently set up to output one file per input library, but could be recoded fairly easily to collate everything into one data file (you can see commented code in the R genotype caller script that does this).