/AlphaThal

Pipeline for retrieving reads from the Alpha Thalassemia region

Primary LanguagePerl

Long Read Pipeline Notes, January 22, 2020:

Scripts in this directory can be used to run the long read pipeline
to retrieve publicly available long reads from the SRA matching a
specific target. These are instructions in preparing to run and
running the pipeline on the biowulf cluster at NIH.

INSTALLING SOFTWARE

1. Install this directory using github:

git clone https://github.com/nhansen/AlphaThal.git

Rename the resulting directory to the name and path of your "top"
directory (see later notes on setting the LONGREADTOPDIR environment
variable).

2. Do all that is necessary to get the sratoolkit running for you on
biowulf (see https://hpc.nih.gov/apps/sratoolkit.html)

PREPARING TARGET FILES

The first part of the pipeline requires some files to tell it what
regions of the genome you would like reads from. Essentially, you want
to create bed and fasta files for unique flanking regions that surround
your (possibly repetitive) targeted parts of the reference, ensuring
that the length of the targeted regions (allowing for variation) from
start to end are roughly the length of the long reads you are targeting.
The files you will need should be placed in the "prep" subdirectory of
your top directory, and are:

ref_and_nonref.flank.fasta - a FASTA file, drawn from the reference
(either hg19 or GRCh38, e.g.), with left and right flanking regions.
the fasta entry names should be of the form "<chr>_<start>_<end>" where
these values reflect the 1-based coordinates of the region in the entry.
This fasta file should also have a fasta index with ".fai" appended to
its name.

all_baits.bed - a BED file of the regions between the flanking regions
in the previous file, 0 based. After the three columns for chromosome,
start, and end, add two more (tab-delimited columns): the fourth column
should be a string describing the targeted sequence (e.g., "LTR5B"), and
the fifth column should be a string uniquely identifying that particular
targeted region. So the fourth column can contain non-unique values, but
the fifth column must have values that are unique for each row of the file.

RUNNING THE PIPELINE

1. Set the environment variable LONGREADTOPDIR to the path to where you 
have this directory, e.g.,:

export LONGREADTOPDIR=/data/nhansen/HERV_K_catalog/AlphaThal

2. The pipeline can be run either with reads downloaded from the SRA, or 
with long reads contained in fasta files in a single directory you 
specify. To see samples in the SRA for which long reads are available,
view them, along with their coverage, by typing

$LONGREADTOPDIR/scripts/sh.samplecoverage

3. To launch biowulf jobs to cull reads from a particular SRA biosample,
run the following command:

$LONGREADTOPDIR/scripts/run_longread_pipeline.pl --flankseqs $LONGREADTOPDIR/prep/ref_and_nonref.flank.fasta --baitregions $LONGREADTOPDIR/prep/all_baits.bed --scripts $LONGREADTOPDIR/scripts <BIOSAMPLE> <SAMPLE DIRECTORY>

where <BIOSAMPLE> is the biosample name for your sample shown in the output
of sh.samplecoverage (e.g., SAMN04325239) and <SAMPLE DIRECTORY> is the full
path to a subdirectory of your "refgenotype" directory, where results for
this sample will be written.

4. To use fasta files of long reads contained in a directory, use the "localfastadir" option:

$LONGREADTOPDIR/scripts/run_longread_pipeline.pl --flankseqs $LONGREADTOPDIR/prep/ref_and_nonref.flank.fasta --baitregions $LONGREADTOPDIR/prep/all_baits.bed --scripts $LONGREADTOPDIR/scripts/localfastadir /path/to/myfastafiles <BIOSAMPLE> <SAMPLE DIRECTORY>