SPACHETE

Flow of SPACHETE:

SPACHETE_wrapper.sh, which is just a wrapper script to assign paths and then pass them into spachete_feeder.py
spachete_feeder.py then creates output directories, and calls spachete_run.py in a parallel way on each stem
spachete_run.py is basically "run.py" but with a SPORK call placed in there and some machete calls removed the actual SPORK call is to "SPORK_main.py"

The output directory of SPACHETE will have a subdirectory for each stem, and each of these subdirectories will look the same as that of MACHETE, with an additional folder called "spork_out" which has the spork run data and information

├── SPACHETE_out_CML
    ├── SRR3192409
       ├── BadFJ
       ├── BowtieIndels
       ├── BowtieIndex
       ├── DistantPEFiles
       ├── err_and_out
       ├── FarJuncIndels
       ├── FarJuncSecondary
       ├── FarJunctionAlignments
       ├── fasta
       ├── GLM_classInput
       ├── IndelsHistogram
       ├── machete_SRR3192409.err
       ├── machete_SRR3192409.out
       ├── reports
       ├── slurm_SRR3192409.err
       ├── slurm_SRR3192409.out
       ├── *spork_out*
       └── StemList.txt
    ├── SRR3192410
    ├── SRR3192411
    ├── SRR3192412
    ├── SRR3192413
    ├── SRR3192415
    ├── SRR3192416
    ├── SRR3192417
    ├── SRR3192422
    └── StemList.txt

Changes should only need to be made in the SPACHETE_wrapper.sh in order to get spachete working.

In this run_SPACHETE.sh information such as data input paths, data output paths, and index paths will likely need to be changed, but most of them will be identical to what MACHETE currently uses on 7-Bridges:

KNIFE_DIR is the KNIFE output directory (same as MACHETE), that contains the orig file (i.e. "/scratch/PI/horence/gillian/CML_UConn/circpipe_K562")
OUT_DIR is the directory location to save SPACHETE output (i.e. "/scratch/PI/horence/rob/spachete_outputs/SPACHETE_out_CML"). In this directory a subdirectory will be created for each stem in the KNIFE_DIR.
EXONS_DIR should be the same path as that used for MACHETE (i.e. "/scratch/PI/horence/gillian/HG19exons")
INDEL_INDICES is the same path as that used for MACHETE (i.e. "/scratch/PI/horence/gillian/HG19_reg_indels/toyIndelIndices/")
CIRC_REF is the same path as that used for MACHETE (i.e. "/share/PI/horence/circularRNApipeline_Cluster/index")
optionally can include a line to instruct SPACHETE to only run on certain stems. An example is STEM_INCLUDE_ONLY_LIST=("SRR3192413" "SRR3192415") which only only run these two stems. If STEM_INCLUDE_ONLY_LIST is not defined or has nothing in it, all stems will be processed (note I'm not sure if any of STEM_INCLUDE_ONLY_LIST handling are bash-specific in run_SPACHETE.sh)

MACHETE

More Accurate vs Mismatched Alignment CHimEra Tracking Engine

MACHETE is a fusion detection software that is in development.

REQUIREMENTS:

KNIFE - https://github.com/lindaszabo/KNIFE NOTE -- before running KNIFE, please make sure your fastq files are in the following format:

"UNIQUE STRING"_"read1 or 2 file notation".fq

some examples of acceptable things are pairs called ABCD_EFG_HIJ_KLM_1.fq & ABCD_EFG_HIJ_KLM_2.fq

or ABCD_EFG_HIJ_KLM_R1.fq & ABCD_EFG_HIJ_KLM_R2.fq

or ABCD_EFG_HIJ_KLM_001.fq & ABCD_EFG_HIJ_KLM_002.fq

The last field before ".fq" has to be the part that indicates whether the file contains R1 or R2 partners.

R version 3.0.2 - or a later version but with the package "data.table" installed
Bowtie2
python version 2.7.5

Before you use: Please download "PICKLES" -- the HG19 annotated exon information created for the purpose of this script

For Sherlock users, this is the directory "/scratch/PI/horence/gillian/HG19exons/" For Sherlock users, please download the REG INDELS INDICES from - /scratch/PI/horence/gillian/HG19_reg_indels/IndelIndices/

For other users, please email glhsieh@stanford.edu for this directory.

Requires paired end reads but aligns in a single end fashion.

Please open and change the script createFarJunctions_SLURM.sh

line31 - change your INSTALLDIR to the full path to the MACHETE script. Last char must be "/" line32 - change CIRCREF to the path to the reference libraries generated for KNIFE e.g. directory that contains hg19_genome, hg19_transcriptome, hg19_junctions_reg and hg19_junctions_scrambled bowtie indices. line41 - PICKLEDIR - please change this path to the path that you used to store the HG19exons directory above

Running MACHETE: First run KNIFE script completely to generate linear and scrambled junction reports and alignments.

sh createFarJunctions_SLURM.sh <1. KNIFE parent directory> <2. output directory> <3. discordant read distance> <4. ref genome> <5. #indels to use> <6. special queue>

NOTE -- ALL directory inputs must end in "/"

KNIFE parent directory - contains output from the KNIFE algorithm. This directory is the path to the directory that contains "circReads", "orig", "logs", "sampleStats"
output directory - must already be in existence.
discordant read distance -- In testing mode, using 100000 base pairs to identify paired end reads that aligned discordantly.
"HG19" is the only option currently
Currently using KNIFE convention of "8" for files with read lengths < 70 and "13" for files with read lengths > 70
-- "owners" if you want to run in owners queue, otherwise leave #6 blank

Example command: sh createFarJunctions_SLURM.sh /scratch/PI/horence/alignments/EWS_FLI_bigmem/ /scratch/PI/horence/alignments/EWS_FLI_bigmem/FarJunc/ 100000 HG19 13 owners

======================================================

DEBUGGING THE CODE once you run the MACHETE, you will notice a file in the MACHETE output directory called "MasterError.txt" Below is the workflow for the MACHETE and there are several parallel steps and several sequential steps. All of these must complete for each unique pair of files for the MACHETE to be complete.

SEQUENCE1 and SEQUENCE 5 are started in parallel

SEQUENCE 1 j1 - sort reg j2 - sort genome j3 - PE finder.sh j4 - DistantPE_counter j5 - sortPairedEnds j6 - make Junctions j6a - linkFastaFiles

From j6, three separate sequences (SEQUENCE 2, 3, 4) are called in parallel

SEQUENCE 2 j7 - LenientBadFJ.sh dependstr7 - BadFJ & BadFJver2

SEQUENCE 3 j8 - alignUnalignedFJ j9 - FJNaiveRept j15a - parse_FJ_ID_for_GLM

SEQUENCE 4 j10 - makeIndelFiles j11 - BowtieIndexFJindels J13 - BowtieAlignFJIndels j14 - FindAlignmentArtifact j19 - FJIndelsClassOutput

SEQUENCE 5 (concurrently with SEQUENCE 1) j16 - AlignUnalignedtoRegIndel-> j17 - AddIndelstoLinearGLM

j18 - regIndelsClassOutput

When SEQUENCE 3-5 are complete then the script j15b - run_GLM.sh is called

When SEQUENCE2 and j15b - run_GLM.sh are complete -- j15 - AppendNaiveRept.sh is called. This completes the MACHETE.

========================================= Output directories

DistantPEfiles - _distant_pairs.txt -- list of three columns - readID / position1 / position2. Position1 and Position2 are the windows surrounding "discordant" bases from either the genome or reg alignment files. Subdirectory : A. file chr(1,2,3,..,X,Y)_Distant_PE_frequency.txt. If, for the pair of windows, the upstream location is on chrA, then it will be binned into the chrA_Distant_PE_frequency.txt file. Three columns appear -- Window1, Window2, and the number of times that these two windows were matched. B. file sorted_chr(1,2,3,..)_Distant_PE_frequency.txt -- File A, sorted on the numerical value of the first column followed by the numerical values of the second column, to reduce parsing time later.
fasta - Subdirectory : A: _chr(1,2,3,..)FarJunctions.fa - a fusion reference fasta file is created from the corresponding sorted_chr(1,2,3,...)Distant_PE_Frequency. _FarJunctions.fa -- the concatenated list of all fasta entries from fasta//_chr*_FarJunctions.fa
BadFJ/ - A. err/out.txt- error and output files from bowtie alignment of fasta to reference dictionaries B. _BadFJto(Genome/Reg/Junc/transcriptome).sam - aligns fasta/_FarJunctions.fa to the KNIFE reference indices Genome, Reg, Junc, and transcriptome with bowtie parameters "-f --no-sq --no-unal --score-min L,0,-0.24 --n-ceil L,0,100 -p 4 --np 0 --rdg 50,50 --rfg 50,50"
BadFJ_ver2/ - A. err/out.txt- error and output files from bowtie alignment of fasta to reference dictionaries B. _FarJunctions_R1/2.fa - contains reads from fasta/_FarJunctions.fa where all N's are removed from reads and the first 40 nt of the original fasta read is fed into the R1 file and the last 40 nt of the original fasta read is fed into the R2 file. C. _BadFJto(Genome/Reg/Junc/transcriptome).sam - aligns BadFJ_ver2/_FarJunctions_R1/2.fa to the KNIFE reference indices Genome, Reg, Junc, and transcriptome as if they were paired end alignments to allow a read gap. Bowtie parameters are "--no-unal --no-mixed --no-sq -p 8 -I 0 -X 50000 -f --ff"
BowtieIndex - Subdirectory A: _FJ_Index*bt2 -- corresponding bowtie indices are created for each of the fasta files from the fasta/_chr(1,2,3...)_FarJunctions.fa
FarJunctionAlignments - Subdirectory A. unaligned__R1/2.sam - contains reads that were unaligned to the KNIFE reference indices that have been aligned to the bowtie indices in the BowtieIndex// directory
FarJuncSecondary - Subdirectory A. still_unaligned_R1/2.fq - contains reads that were unaligned to the KNIFE reference indices that did not align to the FarJunctions bowtie indices in the BowtieIndex// directory Subdirectory AlignedIndels// A. still_unalignedR1/2_indels(1-5).sam - contains reads that were unaligned to the KNIFE reference indices, did not align to the FarJunctions bowtie indices, but DID align to an the expanded FarJunctions indels bowtie indices. B. AllR1/2_indels.sam -- concatenates all still_unaligned_R1/2_indels(1-5).sam files but removes reads that aligned to multiple indel indices and keeps the one with the best alignment score and removes reads that aligned to the indel indices but did not overlap the junction by the user specified number of nt.
FarJuncIndels

< "IF script doesn't have "INDELS" in glmReports dir means that indels were omitted " >"'

salzmanlab-admin/SPACHETE

SPACHETE

Changes should only need to be made in the SPACHETE_wrapper.sh in order to get spachete working.

MACHETE