Refactor branch consistently failing with test data on RelocaTE2
Closed this issue ยท 6 comments
Hi Bergman lab,
I'm really excited about the recent updates you've made to the pipeline. I'm looking forward to using it with my own data, which I hope to try soon.
However, I wanted to note that I'm consistently getting errors when trying to run mcclintock on the test dataset. It will fail consistently running RelocaTE2. Oddly enough, sometimes the relocaTE2_run rule fails, while sometimes the relocaTE2_post rule fails (unfortunately after wiping my environment and reinstalling mcclintock, I don't currently have any info from runs where the relocaTE2_post rule failed).
I've seen this happen on two different system (one is a cluster, while the other is a server), both running CentOS 7.x I've followed the installation instructions in the README, including updating to the latest release of conda.
Here is the script I used to run the test analysis:
#!/usr/bin/env bash
python3 mcclintock.py \
-r test/sacCer2.fasta \
-c test/sac_cer_TE_seqs.fasta \
-g test/reference_TE_locations.gff \
-t test/sac_cer_te_families.tsv \
-1 test/SRR800842_1.fastq.gz \
-2 test/SRR800842_2.fastq.gz \
-p 25 \
-o test_out 2>&1 | tee test.log
Here is the error message I'm getting from the main log:
[Fri Jun 19 14:07:59 2020]
Error in rule relocaTE2_run:
jobid: 24
output: /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/results/ALL.all_nonre
f_insert.gff, /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/results/ALL.all_ref_insert.gff
conda-env: /datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa
RuleException:
CalledProcessError in line 579 of /datahome/oenothera/mcclintock/test_out/snakemake/3412670/Snakefile:
Command 'source /home/progs/anaconda2/bin/activate '/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa'; set -euo pipefail; /home/progs/anaconda2/envs/mcclintock/bin/python3.8 /datahome/oenothera/mcclintock/test_out/snakemake/3412670/.snakemake/scripts/tmp196igscz.relocate2_run.py' returned non-zero exit status 1.
File "/datahome/oenothera/mcclintock/test_out/snakemake/3412670/Snakefile", line 579, in __rule_relocaTE2_run
File "/home/progs/anaconda2/envs/mcclintock/lib/python3.8/concurrent/futures/thread.py", line 57, in run
I'm also seeing the following error message from the relocaTE2.log file:
[main] CMD: /datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/bwa sampe /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/input/reference.fasta /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/bwa_aln/reference.SRR800842_1.te_repeat.flankingReads.fq.matched.fullreads.bwa.mates.sai /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/bwa_aln/reference.SRR800842_2.te_repeat.flankingReads.fq.matched.fullreads.bwa.mates.sai /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_1.te_repeat.flankingReads.fq.matched.fullreads.fq /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_2.te_repeat.flankingReads.fq.matched.fullreads.fq
[main] Real time: 3.828 sec; CPU: 3.338 sec
[W::bam_merge_core2] No @HD tag found.
/datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_1.te_repeat.flankingReads.fq
/datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_2.te_repeat.flankingReads.fq
/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/clean_pairs_memory.py --fq1 /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_1.te_repeat.flankingReads.fq --fq2 /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_2.te_repeat.flankingReads.fq --repeat /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/te_containing_fq --fq_dir /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/input/fastq --seqtk /datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/seqtk
mergeing bam file: 2/2 files
mergeing fullread bam file: 1/1 files
job: sh /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/shellscripts/step_3/0.te_repeat.blat.sh
job: sh /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/shellscripts/step_3/1.te_repeat.blat.sh
job: sh /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/shellscripts/step_4/step_4.reference.repeat.align.sh
Traceback (most recent call last):
File "/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/relocaTE2.py", line 741, in <module>
main()
File "/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/relocaTE2.py", line 643, in main
existingTE_RM_ALL(top_dir, reference_ins_flag)
File "/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/relocaTE2.py", line 155, in existingTE_RM_ALL
if unit[9] == '+':
IndexError: list index out of range
It looks like the existingTE.bed
file is empty.
I've also attached the full log of my most recent attempt, along with the full relocaTE2 log. If I can provide any more information or if you like me to send an archive of the output directory, please let me know.
Thanks,
Dave
Also just wanted to note that I reran without relocate2 and the rest of the pipeline worked properly:
----------------------------------
MCCLINTOCK SUMMARY REPORT
----------------------------------
Command:
python3 mcclintock.py \
-r test/sacCer2.fasta \
-c test/sac_cer_TE_seqs.fasta \
-g test/reference_TE_locations.gff \
-t test/sac_cer_te_families.tsv \
-1 test/SRR800842_1.fastq.gz \
-2 test/SRR800842_2.fastq.gz \
-p 25 \
-m trimgalore,coverage,ngs_te_mapper,relocate,temp,retroseq,popoolationte,popoolationte2,te-locate \
-o test_out_noRE2
run from directory: /datahome/oenothera/mcclintock
Started: 2020-06-19 17:44:19
Completed: 2020-06-19 19:19:37
----------------------------------
MAPPED READ INFORMATION
----------------------------------
read1 sequence length: 94
read2 sequence length: 94
read1 reads: 18547824
read2 reads: 18558410
median insert size: 302
avg genome coverage: 273.463
----------------------------------
-----------------------------------------------------
METHOD ALL REFERENCE NON-REFERENCE
-----------------------------------------------------
ngs_te_mapper 33 19 14
relocate 80 63 17
relocate2 NA NA NA
temp 366 311 55
retroseq 69 0 69
popoolationte 142 131 11
popoolationte2 184 160 24
te-locate 713 164 549
-----------------------------------------------------
Best,
Dave
- Thanks @davidecarlson for the RelocaTE2 bug report & feedback that the rest of system is work. We'll try to get an answer to why this bug is happening asap.
@davidecarlson thanks for letting us know about this issue.
- I've run RelocaTE2 myself with the test data on a clean install and I haven't been able to reproduce your error, so I may need additional files from you to determine the issue.
- Based on the error reported by Relocate2, it looks like this issue is related to the repeatmasker
.out
file that is provided to RelocaTE2 which is either malformed or empty. It is trying to check the 10th column of one of the lines in the repeatmasker.out
, but that line does not have 10 columns. - This repeatmasker
.out
file is produced by an earlier step in the mcclintock pipeline (https://github.com/bergmanlab/mcclintock/blob/refactor/Snakefile#L402-L420) which uses theprocessing.log
file to log stdout and stderr of the repeatmasker call. - My guess is an uncaught error occurred in this repeatmasking step which resulted in a malformed repeatmasker
.out
file being passed to RelocaTE2, which then caused the parsing error that you reported. - To confirm this and to help determine a fix, could you provide me with the
/datahome/oenothera/mcclintock/test_out/logs/2020-06-19_13.11.51_3412670/processing.log
file so I can see if there are any errors reported from the repeatmasking step? - also could you also provide the
/datahome/oenothera/mcclintock/test_out/method_input/sacCer2.repeatmasker.out
file (if it's not empty) so I can check it out?
Hi Preston,
Thanks for looking into this! I've attached the processing.log and repeatmasker.out files. Note that I manually changed RM file extension so that Github would let me attach it (this was done after I downloaded the file off my server).
Please let me know if I can provide anything else.
Best,
Dave
processing.log
sacCer2.repeatmasker.out.txt
- Ok I'm pretty sure I found the issue. The
sacCer2.repeatmasker.out
file you provided was actually formatted as a gff file, while RelocaTE2 requires the repeatmasker.out
file. This indicates to me that the part of the repeatmask script that is locating the repeatmasker.out
file was not specific enough and could return the wrong file. - I believe I've fixed this in the latest commit: 4b13809
- Please pull these changes when you get the chance and try and run RelocaTE2 and let me know how it goes.
Hi Preston,
That fixed the issue! Thanks! I pulled the changed and reran the test analysis:
----------------------------------
MCCLINTOCK SUMMARY REPORT
----------------------------------
Command:
python3 mcclintock.py \
-r test/sacCer2.fasta \
-c test/sac_cer_TE_seqs.fasta \
-g test/reference_TE_locations.gff \
-t test/sac_cer_te_families.tsv \
-1 test/SRR800842_1.fastq.gz \
-2 test/SRR800842_2.fastq.gz \
-p 35 \
-o test_out
run from directory: /datahome/oenothera/mcclintock
Started: 2020-06-22 14:55:48
Completed: 2020-06-22 16:16:08
----------------------------------
MAPPED READ INFORMATION
----------------------------------
read1 sequence length: 94
read2 sequence length: 94
read1 reads: 18547821
read2 reads: 18558412
median insert size: 302
avg genome coverage: 273.445
----------------------------------
-----------------------------------------------------
METHOD ALL REFERENCE NON-REFERENCE
-----------------------------------------------------
ngs_te_mapper 33 19 14
relocate 80 63 17
relocate2 139 41 98
temp 366 311 55
retroseq 69 0 69
popoolationte 142 130 12
popoolationte2 182 158 24
te-locate 713 164 549
-----------------------------------------------------
Appreciate the help! I'm going to run this version on some of my own data in the near future. I'll let you know how it goes.
Best,
Dave