weng-lab/TEMP2

no junction supporting reads

Closed this issue · 1 comments

Hello,

  • I've noticed that when checking the results from runs of TEMP2 with various samples, the number of supporting reads for the 3' and 5' junctions (col. 14 & 15 of test.insertion.bed) are always reported as 0. Do you know why this might be occurring? It seems unlikely to me that none of the predicted insertions from any of the runs I've done would have junction-supporting reads.
  • When I run the original TEMP on the same samples, I get many non-zero values for the junction supporting reads (col. 10 & 12 of .insertion.refined.bp.summary). Based on the manuals, I assume that these are roughly the same metrics in both methods, but I may be wrong.
  • Also, could you confirm that the values from col. 14&15 of the test.insertion.bed are "split-read" evidence? I am interpreting these values as the number of reads that are split at the junction point, with part mapping to the TE and the other part mapping to the reference location. Just want to make sure this is correct.

Thanks for the help! TEMP and TEMP2 are very useful tools and I just want to make sure I am running and interpreting the results correctly.

-Preston

  • Also, I've added the conda environments and commands I used to run TEMP & TEMP2 on an example dataset so that you could replicate this issue if needed

TEMP conda env

name: temp
channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - perl-bioperl-run=1.006900
  - bwa=0.7.4
  - bedtools=2.17.0
  - samtools=0.1.19
  - ucsc-fatotwobit=366
  - ucsc-twobittofa=366
  - sra-tools

TEMP2 conda env

name: temp2
channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - bwa=0.7.17
  - samtools=1.11
  - bedtools=2.30.0
  - bedops=2.4.39
  - trinity=2.9.1
  - R=4.0
  - ucsc-bedtobigbed=377
  - perl-bioperl-run=1.007002
  - ucsc-twobittofa=377
  - ucsc-fatotwobit=377

running TEMP and TEMP2

# install dependencies
conda env create -f temp.yml --name temp
conda activate temp

# download reads
mkdir data
cd data
prefetch SRR7851840 -o ./SRR7851840.sra
fastq-dump -O . ./SRR7851840.sra --split-e

# download reference
wget https://raw.githubusercontent.com/bergmanlab/mcclintock/master/test/sacCer2.fasta

# map reads
bwa index sacCer2.fasta
bwa mem -t 20 sacCer2.fasta SRR7851840_1.fastq SRR7851840_2.fastq > SRR7851840.sam
samtools view -@ 20 -Sb SRR7851840.sam > SRR7851840.bam
samtools sort -@ 20 SRR7851840.bam SRR7851840.sorted
samtools index SRR7851840.sorted.bam

# make ref TE bed
wget https://raw.githubusercontent.com/bergmanlab/mcclintock/master/test/reference_TE_locations.gff
cat reference_TE_locations.gff | sed 's/ID=//g' | awk -v OFS="\t" -v s=1 '{print $1,$4-1,$5,$9,".",$7}' > reference_TE_locations.bed

# get taxonomy file and consensus TE fasta
wget https://raw.githubusercontent.com/bergmanlab/mcclintock/master/test/sac_cer_te_families.tsv
wget https://raw.githubusercontent.com/bergmanlab/mcclintock/master/test/sac_cer_TE_seqs.fasta

# calculate median insert size
median_insertsize=`cut -f9 SRR7851840.sam | sort -n | awk '{if ($1 > 0) ins[reads++]=$1; } END { print ins[int(reads/2)]; }'`

# run TEMP
mkdir ../temp_out
cd ../temp_out/
time bash ~/git/mcclintock/install/tools/temp/scripts/TEMP_Insertion.sh \
  -x 30 \
  -i ../data/SRR7851840.sorted.bam \
  -s ~/git/mcclintock/install/tools/temp/scripts/ \
  -r ../data/sac_cer_TE_seqs.fasta \
  -t ../data/reference_TE_locations.bed \
  -u ../data/sac_cer_te_families.tsv \
  -m 1 \
  -f $median_insertsize \
  -c 20 \
  -o .

# run TEMP2
conda env create -f temp2.yml --name temp2
conda activate temp2
mkdir ../temp2_out/
cd ../temp2_out/
time bash ~/git/mcclintock/install/tools/temp2/TEMP2 insertion \
  -l ../data/SRR7851840_1.fastq \
  -r ../data/SRR7851840_2.fastq \
  -i ../data/SRR7851840.sorted.bam \
  -g ../data/sacCer2.fasta \
  -R ../data/sac_cer_TE_seqs.fasta \
  -t ../data/reference_TE_locations.bed \
  -c 20 -f $median_insertsize \
  -o . \
  -M 2 \
  -m 5 \
  -U 0.8 \
  -N 300

Thanks for using TEMP and TEMP2!
Actually, column 14th/15th is split reads defined as those concordantly aligned reads but one end of it both align to the reference genome and transposon. This is a bit different from TEMP, as TEMP will look back if any supporting read is split (TEMP2 only considers discordantly aligned reads) after identified all the insertions. Also, the number of split reads in the 14th/15th columns of TEMP2 depends on how long the reads are, as the script needs at least 20–25 overhang in both reference genome and transposon. However, in TEMP, a read is considered as a split read once it is soft clipped (>1nt).