bergmanlab/mcclintock

Refactor branch consistently failing with test data on RelocaTE2

Closed this issue ยท 6 comments

Hi Bergman lab,

I'm really excited about the recent updates you've made to the pipeline. I'm looking forward to using it with my own data, which I hope to try soon.

However, I wanted to note that I'm consistently getting errors when trying to run mcclintock on the test dataset. It will fail consistently running RelocaTE2. Oddly enough, sometimes the relocaTE2_run rule fails, while sometimes the relocaTE2_post rule fails (unfortunately after wiping my environment and reinstalling mcclintock, I don't currently have any info from runs where the relocaTE2_post rule failed).

I've seen this happen on two different system (one is a cluster, while the other is a server), both running CentOS 7.x I've followed the installation instructions in the README, including updating to the latest release of conda.

Here is the script I used to run the test analysis:

#!/usr/bin/env bash

python3 mcclintock.py \
-r test/sacCer2.fasta \
-c test/sac_cer_TE_seqs.fasta \
-g test/reference_TE_locations.gff \
-t test/sac_cer_te_families.tsv \
-1 test/SRR800842_1.fastq.gz \
-2 test/SRR800842_2.fastq.gz \
-p 25 \
-o test_out 2>&1 | tee test.log

Here is the error message I'm getting from the main log:

[Fri Jun 19 14:07:59 2020]
Error in rule relocaTE2_run:
    jobid: 24
    output: /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/results/ALL.all_nonre
f_insert.gff, /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/results/ALL.all_ref_insert.gff
    conda-env: /datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa

RuleException:
CalledProcessError in line 579 of /datahome/oenothera/mcclintock/test_out/snakemake/3412670/Snakefile:
Command 'source /home/progs/anaconda2/bin/activate '/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa'; set -euo pipefail;  /home/progs/anaconda2/envs/mcclintock/bin/python3.8 /datahome/oenothera/mcclintock/test_out/snakemake/3412670/.snakemake/scripts/tmp196igscz.relocate2_run.py' returned non-zero exit status 1.
  File "/datahome/oenothera/mcclintock/test_out/snakemake/3412670/Snakefile", line 579, in __rule_relocaTE2_run
  File "/home/progs/anaconda2/envs/mcclintock/lib/python3.8/concurrent/futures/thread.py", line 57, in run

I'm also seeing the following error message from the relocaTE2.log file:

[main] CMD: /datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/bwa sampe /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/input/reference.fasta /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/bwa_aln/reference.SRR800842_1.te_repeat.flankingReads.fq.matched.fullreads.bwa.mates.sai /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/bwa_aln/reference.SRR800842_2.te_repeat.flankingReads.fq.matched.fullreads.bwa.mates.sai /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_1.te_repeat.flankingReads.fq.matched.fullreads.fq /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_2.te_repeat.flankingReads.fq.matched.fullreads.fq
[main] Real time: 3.828 sec; CPU: 3.338 sec
[W::bam_merge_core2] No @HD tag found.
/datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_1.te_repeat.flankingReads.fq
/datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_2.te_repeat.flankingReads.fq
/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/clean_pairs_memory.py --fq1 /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_1.te_repeat.flankingReads.fq --fq2 /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/flanking_seq/SRR800842_2.te_repeat.flankingReads.fq --repeat /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/repeat/te_containing_fq --fq_dir /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/input/fastq --seqtk /datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/seqtk
mergeing bam file: 2/2 files
mergeing fullread bam file: 1/1 files
job: sh /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/shellscripts/step_3/0.te_repeat.blat.sh
job: sh /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/shellscripts/step_3/1.te_repeat.blat.sh
job: sh /datahome/oenothera/mcclintock/test_out/results/relocaTE2/unfiltered/shellscripts/step_4/step_4.reference.repeat.align.sh
Traceback (most recent call last):
  File "/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/relocaTE2.py", line 741, in <module>
    main()
  File "/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/relocaTE2.py", line 643, in main
    existingTE_RM_ALL(top_dir, reference_ins_flag)
  File "/datahome/oenothera/mcclintock/install/envs/conda/ed4c2eaa/bin/relocaTE2.py", line 155, in existingTE_RM_ALL
    if unit[9] == '+':
IndexError: list index out of range

It looks like the existingTE.bed file is empty.

I've also attached the full log of my most recent attempt, along with the full relocaTE2 log. If I can provide any more information or if you like me to send an archive of the output directory, please let me know.

Thanks,
Dave

test.log
relocaTE2.log

Also just wanted to note that I reran without relocate2 and the rest of the pipeline worked properly:

----------------------------------
MCCLINTOCK SUMMARY REPORT
----------------------------------
Command:
python3 mcclintock.py  \
        -r test/sacCer2.fasta  \
        -c test/sac_cer_TE_seqs.fasta  \
        -g test/reference_TE_locations.gff  \
        -t test/sac_cer_te_families.tsv  \
        -1 test/SRR800842_1.fastq.gz  \
        -2 test/SRR800842_2.fastq.gz  \
        -p 25  \
        -m trimgalore,coverage,ngs_te_mapper,relocate,temp,retroseq,popoolationte,popoolationte2,te-locate  \
        -o test_out_noRE2

run from directory: /datahome/oenothera/mcclintock
Started:    2020-06-19 17:44:19
Completed:  2020-06-19 19:19:37

----------------------------------
MAPPED READ INFORMATION
----------------------------------
read1 sequence length:  94
read2 sequence length:  94
read1 reads:            18547824
read2 reads:            18558410
median insert size:     302
avg genome coverage:    273.463
----------------------------------

-----------------------------------------------------
METHOD          ALL       REFERENCE    NON-REFERENCE
-----------------------------------------------------
ngs_te_mapper   33        19           14
relocate        80        63           17
relocate2       NA        NA           NA
temp            366       311          55
retroseq        69        0            69
popoolationte   142       131          11
popoolationte2  184       160          24
te-locate       713       164          549
-----------------------------------------------------

Best,
Dave

  • Thanks @davidecarlson for the RelocaTE2 bug report & feedback that the rest of system is work. We'll try to get an answer to why this bug is happening asap.

@davidecarlson thanks for letting us know about this issue.

  • I've run RelocaTE2 myself with the test data on a clean install and I haven't been able to reproduce your error, so I may need additional files from you to determine the issue.
  • Based on the error reported by Relocate2, it looks like this issue is related to the repeatmasker .out file that is provided to RelocaTE2 which is either malformed or empty. It is trying to check the 10th column of one of the lines in the repeatmasker .out , but that line does not have 10 columns.
  • This repeatmasker .out file is produced by an earlier step in the mcclintock pipeline (https://github.com/bergmanlab/mcclintock/blob/refactor/Snakefile#L402-L420) which uses the processing.log file to log stdout and stderr of the repeatmasker call.
  • My guess is an uncaught error occurred in this repeatmasking step which resulted in a malformed repeatmasker .out file being passed to RelocaTE2, which then caused the parsing error that you reported.
  • To confirm this and to help determine a fix, could you provide me with the /datahome/oenothera/mcclintock/test_out/logs/2020-06-19_13.11.51_3412670/processing.log file so I can see if there are any errors reported from the repeatmasking step?
  • also could you also provide the /datahome/oenothera/mcclintock/test_out/method_input/sacCer2.repeatmasker.out file (if it's not empty) so I can check it out?

Hi Preston,
Thanks for looking into this! I've attached the processing.log and repeatmasker.out files. Note that I manually changed RM file extension so that Github would let me attach it (this was done after I downloaded the file off my server).

Please let me know if I can provide anything else.

Best,
Dave
processing.log
sacCer2.repeatmasker.out.txt

  • Ok I'm pretty sure I found the issue. The sacCer2.repeatmasker.out file you provided was actually formatted as a gff file, while RelocaTE2 requires the repeatmasker .out file. This indicates to me that the part of the repeatmask script that is locating the repeatmasker .out file was not specific enough and could return the wrong file.
  • I believe I've fixed this in the latest commit: 4b13809
  • Please pull these changes when you get the chance and try and run RelocaTE2 and let me know how it goes.

Hi Preston,
That fixed the issue! Thanks! I pulled the changed and reran the test analysis:

----------------------------------
MCCLINTOCK SUMMARY REPORT
----------------------------------
Command:
python3 mcclintock.py  \
        -r test/sacCer2.fasta  \
        -c test/sac_cer_TE_seqs.fasta  \
        -g test/reference_TE_locations.gff  \
        -t test/sac_cer_te_families.tsv  \
        -1 test/SRR800842_1.fastq.gz  \
        -2 test/SRR800842_2.fastq.gz  \
        -p 35  \
        -o test_out

run from directory: /datahome/oenothera/mcclintock
Started:    2020-06-22 14:55:48
Completed:  2020-06-22 16:16:08

----------------------------------
MAPPED READ INFORMATION
----------------------------------
read1 sequence length:  94
read2 sequence length:  94
read1 reads:            18547821
read2 reads:            18558412
median insert size:     302
avg genome coverage:    273.445
----------------------------------

-----------------------------------------------------
METHOD          ALL       REFERENCE    NON-REFERENCE
-----------------------------------------------------
ngs_te_mapper   33        19           14
relocate        80        63           17
relocate2       139       41           98
temp            366       311          55
retroseq        69        0            69
popoolationte   142       130          12
popoolationte2  182       158          24
te-locate       713       164          549
-----------------------------------------------------

Appreciate the help! I'm going to run this version on some of my own data in the near future. I'll let you know how it goes.
Best,
Dave