RelocaTE and RelocaTE2 run infinitely or freeze
lhemmer opened this issue · 38 comments
Hello Casey and others,
I have been working with your test data as well as some other data with your pipeline since I've been using your older version of McClintock for quite some time. I was excited to include RelocaTE and RelocaTE2 but sadly these are the programs that are giving me the most headaches.
Everything runs smoothly on the test data after installation, I haven't seen any problems in the log outputs there. However, when I run the pipeline on my own data, both default and specifying specific programs, there are some problems. RelocaTE shows output in the log file but takes a really long time, I ran the default pipeline for 120 hours on our remote cluster and it timed out before RelocaTE was finished when every other program was complete hours after I started the job (except RelocaTE2, but I'll explain). The fastq files are larger, 7.5 GB for each of the paired-ends, but with multi-threading this seems too long.
Additionally, RelocaTE2 runs indefinitely, but there's almost no information I can gather about where the problem is occurring. This is the log file I have so far after running the pipeline with just trimgalore and RelocaTE2 for nearly 48 hours (and all of this text was available after 30 minutes of run time).
output/48_all
Job counts:
count jobs
1 index_reference_genome
1 make_consensus_fasta
1 make_reference_fasta
1 map_reads
1 median_insert_size
1 relocaTE2_post
1 relocaTE2_run
1 repeatmask
1 setup_reads
1 summary_report
10
PROCESSING making consensus fasta
PROCESSING consensus fasta created
PROCESSING making reference fasta
PROCESSING reference fasta created
PROCESSING making samtools and bwa index files for reference fasta &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/logs/20200720.120349.7542720/processing.log
PROCESSING samtools and bwa index files for reference fasta created
PROCESSING prepping reads for McClintock
PROCESSING running trim_galore &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/logs/20200720.120349.7542720/trimgalore.log
PROCESSING read setup complete
PROCESSING Running RepeatMasker for RelocaTE2 &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/logs/20200720.120349.7542720/processing.log
PROCESSING Repeatmasker for RelocaTE2 complete
PROCESSING mapping reads to reference &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/logs/20200720.120349.7542720/bwa.log
PROCESSING read mapping complete
PROCESSING calculating median insert size of reads
PROCESSING median insert size of reads calculated
Conda environment defines Python version < 3.5. Using Python of the master process to execute script. Note that this cannot be avoided, because the script uses data structures from Snakemake which are Python >=3.5 only.
So there is no explanation of what might be going wrong. I did notice that while the last line looks like an error:
Conda environment defines Python version < 3.5. Using Python of the master process to execute script. Note that this cannot be avoided, because the script uses data structures from Snakemake which are Python >=3.5 only.
This line also present in the log output of the entire McClintock pipeline on the test dataset but not present in the RelocaTE2 log output. I'm wondering then how RelocaTE2 completely stalls or runs indefinitely without any visible progress when it works on the test data. The only other clue I can give is the error message that is given when the job hits the time limit. But I see similar kinds of error messages when I stop the pipeline prematurely in general.
RuleException:
CalledProcessError in line 578 of /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/snakemake/3427427/Snakefile:
Command 'source /scratch/lhemmer/programs/miniconda3/bin/activate '/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/32bdc8a2'; set -euo pipefail; /scratch/lhemmer/programs/miniconda3/envs/mcclintock/bin/python3.8 /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/snakemake/3427427/.snakemake/scripts/tmp70q2ux3x.relocate2_run.py' died with <Signals.SIGTERM: 15>.
File "/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/snakemake/3427427/Snakefile", line 578, in __rule_relocaTE2_run
File "/scratch/lhemmer/programs/miniconda3/envs/mcclintock/lib/python3.8/concurrent/futures/thread.py", line 57, in run
RuleException:
CalledProcessError in line 525 of /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/snakemake/3427427/Snakefile:
Command 'source /scratch/lhemmer/programs/miniconda3/bin/activate '/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/272a644b'; set -euo pipefail; /scratch/lhemmer/programs/miniconda3/envs/mcclintock/bin/python3.8 /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/snakemake/3427427/.snakemake/scripts/tmppmwkuerv.relocate_run.py' died with <Signals.SIGTERM: 15>.
File "/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/snakemake/3427427/Snakefile", line 525, in __rule_relocaTE_run
File "/scratch/lhemmer/programs/miniconda3/envs/mcclintock/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Any help is appreciated, and if there are any additional outputs or files you need let me know. Thanks!
Lucas Hemmer
Thanks for the issue report. @pbasting and I will discuss and see what we can do to help resolve this.
- @lhemmer can you please post the contents of the RelocaTE and RelocaTE2 specific log files? Based on the log provided the RelocaTE2 log should be at
/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/logs/20200720.120349.7542720/relocaTE2.log
- This should help me determine at what step the stalling is occurring. Thanks!
Of course. While the job ran for 48 hours, this was the only information written to the log after running for about an hour.
[bwa_index] Pack FASTA... 1.85 sec [bwa_index] Construct BWT for the packed sequence... [BWTIncCreate] textLength=311996502, availableWord=33953140 [BWTIncConstructFromPacked] 10 iterations done. 56007462 characters processed. [BWTIncConstructFromPacked] 20 iterations done. 103469750 characters processed. [BWTIncConstructFromPacked] 30 iterations done. 145650374 characters processed. [BWTIncConstructFromPacked] 40 iterations done. 183136566 characters processed. [BWTIncConstructFromPacked] 50 iterations done. 216450358 characters processed. [BWTIncConstructFromPacked] 60 iterations done. 246055718 characters processed. [BWTIncConstructFromPacked] 70 iterations done. 272365014 characters processed. [BWTIncConstructFromPacked] 80 iterations done. 295744758 characters processed. [bwa_index] 83.03 seconds elapse. [bwa_index] Update BWT... 0.69 sec [bwa_index] Pack forward-only FASTA... 1.44 sec [bwa_index] Construct SA from BWT and Occ... 30.35 sec [bwt_gen] Finished constructing BWT in 88 iterations. [main] Version: 0.6.2-r126 [main] CMD: /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/32bdc8a2/bin/bwa index /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/input/reference.fasta [main] Real time: 118.454 sec; CPU: 117.373 sec /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/input/fastq fastq Reference need to be indexed by bwa: /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/input/reference.fasta job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/shellscripts/step_2/0.fq2fa.sh job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/shellscripts/step_2/1.fq2fa.sh
- Ok thanks this tells me that for RelocaTE2, the job is freezing during the RelocaTE2 execution.
- Also can you tell me how much memory you are allocating for the job running mcclintock? One thing I want to rule out before digging into the specifics of the script is the possibility that RelocaTE2 has run out of physical memory forcing it to use swap which would slow everything to a crawl like what you are seeing.
- If you think you are using sufficient memory, can you try executing these commands (if you still have them):
sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/shellscripts/step_2/0.fq2fa.sh
sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/shellscripts/step_2/1.fq2fa.sh
- These are the last steps reported in the log. RelocaTE2 is using the python package multiprocessing to call scripts that it generates, which I've noticed does a poor job of passing error messages to the the main script stderr when a job fails, and sometimes a failed multiprocessing job will just persist forever even if it's task has failed. This could cause the issues you are describing.
I have allowed 120 GB of memory for job submission. I've learned while using the earlier McClintock version to not be shy with memory. Should I execute the commands in a job submission following the same steps as starting the entire McClintock pipeline?
CONDA_BASE=$(conda info --base)
source ${CONDA_BASE}/etc/profile.d/conda.sh
conda activate mcclintock
sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/shellscripts/step_2/0.fq2fa.sh'
You shouldn't have to activate any conda environments to run them as the script uses absolute paths for everything. they are one line scripts that call seqtk to convert your fastq file to a fasta file.
I've run the two commands but I'm unsure what I'm supposed to be looking for
- Did these commands execute completely without error, or hang?
I didn't notice any errors or hang. As far as I can tell, these are meant to convert the fastq files to fasta and that was done successfully
- Ok so that likely isn't the step that RelocaTE2 was stuck on if you could run it without issue.
- Can you run
ls -lrth /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/shellscripts/
and post the results? - This will let me know which step RelocaTE2 was running last as it seems that step_2 should have run without issue.
Sure thing. Here's the output I got:
-rw-rw----+ 1 lhemmer lhemmer 1 Jul 20 15:26 step_0_te_annotation_provided
drwxrwx---+ 2 lhemmer lhemmer 4.0K Jul 20 15:26 step_2
-rw-rw----+ 1 lhemmer lhemmer 1 Jul 20 15:26 step_2_not_needed_fq_already_converted_2_fa
drwxrwx---+ 2 lhemmer lhemmer 4.0K Jul 20 15:27 step_3
- Ok so RelocaTE2 appears to have stalled on step_3 which runs BLAT to map the reads to the consensus TE fasta file, then uses these mapping results to extract the TE flanking reads.
- can you post the contents of the following two files:
/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/shellscripts/step_3/0.te_repeat.blat.sh
/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/blat_output/blat.out
- can you run and post the results of:
ls -lrth /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered//repeat/flanking_seq/
- This should let me know if the BLAT step or the
relocaTE_trim.py
step failed.
Sure thing. Here's the content of 0.te_repeat.blat.sh
/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/32bdc8a2/bin/blat -minScore=10 -tileSize=7 /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/input/consensus.fasta /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/input/fastq/48_1.fa /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/blat_output/48_1.te_repeat.blatout 1>>/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/blat_output/blat.out 2>>/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/blat_output/blat.out
python /scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/32bdc8a2/bin/relocaTE_trim.py /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/blat_output/48_1.te_repeat.blatout /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/input/fastq/48_1.fq 10 10 2 > /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/flanking_seq/48_1.te_repeat.flankingReads.fq
There's a 1.te_repeat.blat.sh file as well with the following contents
/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/32bdc8a2/bin/blat -minScore=10 -tileSize=7 /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/input/consensus.fasta /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/input/fastq/48_2.fa /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/blat_output/48_2.te_repeat.blatout 1>>/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/blat_output/blat.out 2>>/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/blat_output/blat.out
python /scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/32bdc8a2/bin/relocaTE_trim.py /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/blat_output/48_2.te_repeat.blatout /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/input/fastq/48_2.fq 10 10 2 > /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/flanking_seq/48_2.te_repeat.flankingReads.fq`
The blat.out file was empty. The results of the ls -lrth command were also empty
total 0
- Thanks @lhemmer it looks like the BLAT step is where the script stalled then.
- You should be able to re-run these steps and see if it completes outside of the RelocaTE2 framework with the following command:
CONDA_BASE=$(conda info --base)
source ${CONDA_BASE}/etc/profile.d/conda.sh
conda activate /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/32bdc8a2/
sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/shellscripts/step_3/0.te_repeat.blat.sh &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/step3_1.oe
sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/shellscripts/step_3/1.te_repeat.blat.sh &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/step3_2.oe
- You probably would want to run this in a job submission as I'm not sure how long it will run with your data.
- If the script completes, you can check
/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/step3_1.oe
and/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/step3_2.oe
to see if any errors were thrown. - If this works, then it implies something is going wrong with how RelocaTE2 is executing these scripts.
- If it throws an error, then that will point to why RelocaTE2 couldn't progress pass that step.
- If it runs successfully, the BLAT steps should produce the following files:
/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/blat_output/48_1.te_repeat.blatout
/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/blat_output/48_2.te_repeat.blatout
- The
relocaTE_trim.py
step should produce the following files:
/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/flanking_seq/48_1.te_repeat.flankingReads.fq
/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate/results/relocaTE2/unfiltered/repeat/flanking_seq/48_2.te_repeat.flankingReads.fq
Just an update, there were 48_1.te_repeat.blatout and 48_2.te_repeat.blatout files produced in the original run but the blat.out file was originally empty. I'm currently running the commands you told me to run but it is taking awhile. I started the run last night and the new 48_1.te_repeat.blatout file is 1.3 GB while the old one was 4.4 GB, there is no new 48_2.te_repeat.blatout yet. It might be taking awhile because I don't think it's using the multithreads.
- There isn't a 48_2.te_repeat.blatout yet because the second script won't execute until the first is finished. It's also possible that the blatout file will be larger than 4.4GB as that is where it was at when your job got killed so it will likely be larger when it is finished.
- BLAT is not capable of multithreading so this step will be slow with large fastq files. RelocaTE2 tries to speed this up a bit by running the *_1.fastq BLAT call and the *_2.fastq BLAT call concurrently using the python multiprocessing package to call both commands at once, but this step won't run any faster when using more than 2 threads. This is why both blatout files existed when you originally ran mcclintock.
- It's possible that the default RelocaTE2 just takes more than 48 hours with your dataset to complete due to this bottleneck with using BLAT as it's aligner. It may not have been frozen, just taking a very long time to align the reads. If you try to run RelocaTE2 again, you could confirm this by checking that the blatout file is still growing over time.
- RelocaTE2 does have the option to use bwa or bowtie2 instead of BLAT, which both are capable of taking advantage of multiple processors.
- You can change which aligner is used by RelocaTE2 by modifying the associated config file:
${path_to_mcclintock_install}/config/relocate2/relocate2_run.py
and changing'aligner' : "blat"
to'aligner' : "bwa"
. I personally haven't tested this out yet though so I'm not sure whether this change works or not.
Hello again.
So there is a change when you substitute bwa for blat in relocate2_run.py. It produces nearly empty 48_1.te_repeat.bam and 48_2.te_repeat.bam files and the pipeline moves on. The remainder of the steps throw a lot of subsequent errors which can understandably be due to the empty blat output which is now an empty bam file. I'm not sure if there is a fix for that but it appears that it's not necessarily an error, just that the pipeline is handcuffed by blat. Everything besides relocaTE and relocaTE2 work quite well!
Thanks for all of your help!
- I've been able to run RelocaTE2 with the test data using bowtie2 by editing the config file and changing
'aligner' : "blat" to 'aligner' : "bowtie2"
. Might be worth a shot as Bowtie2 will utilize multiple threads and it doesn't appear that the bwa option is working properly. I'd try this with a clean run removing all the previous RelocaTE2 intermediate files from theunfiltered
directory just to ensure there aren't any clashes
So running with bowtie2 seems to bypass some errors but it then runs into some more. But at least it's much faster than blat!
This was the error from the general log output:
RELOCATE2 RelocaTE2 run complete
Conda environment defines Python version < 3.5. Using Python of the master process to execute script. Note that this cannot be avoided, because the script uses data structures from Snakemake which are Python >=3.5 only.
RELOCATE2 processing RelocaTE2 results
Traceback (most recent call last):
File "/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/snakemake/1705891/.snakemake/scripts/tmpn2azzuzr.relocate2_post.py", line 122, in <module>
main()
File "/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/snakemake/1705891/.snakemake/scripts/tmpn2azzuzr.relocate2_post.py", line 51, in main
mccutils.make_nonredundant_bed(all_insertions, sample_name, out_dir, method="relocate2")
File "/scratch/alarracu_lab/Lucas/mcclintock/scripts/mccutils.py", line 382, in make_nonredundant_bed
if (uniq_inserts[key].left_support + uniq_inserts[key].right_support) < (insert.left_support + insert.right_support):
AttributeError: 'Insertion' object has no attribute 'left_support'
[Fri Jul 24 22:39:35 2020]
Error in rule relocaTE2_post:
jobid: 1
output: /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/48_relocate2_redundant.bed, /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/48_relocate2_nonredundant.bed
conda-env: /scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/32bdc8a2
The relocaTE2 log file is around 30 MB so it ran for quite a ways! There was a listing of some of the TE insertions detected followed by this line (I'll add lines before and after to help):
Y_scaffold4 RelocaTE2 not_given 2743407 2743409 . - . ID=repeat_Y_scaffold4_2743407_2743409;Name=Copia_LTR;TSD=insufficient_data;Note=Non-reference, not found in reference;Right_junction_reads=2;Left_junction_reads=0;Right_support_reads=0;Left_support_reads=0; Y_scaffold4 2742716 2744198 Gypsy_1_DSim_I:2742716-2744198 0 +
Y_scaffold6 RelocaTE2 not_given 338195 338197 . + . ID=repeat_Y_scaffold6_338195_338197;Name=PROTOP/ZAM_I;TSD=supporting_junction;Note=Non-reference, not found in reference;Right_junction_reads=1;Left_junction_reads=0;Right_support_reads=0;Left_support_reads=2; Y_scaffold6 338145 338426 ZAM_I:338145-338426 0 -
Y_scaffold6 RelocaTE2 not_given 343953 343955 . - . ID=repeat_Y_scaffold6_343953_343955;Name=ROO_LTR;TSD=insufficient_data;Note=Non-reference, not found in reference;Right_junction_reads=0;Left_junction_reads=4;Right_support_reads=0;Left_support_reads=0; Y_scaffold6 343803 344103 STALKER4_I:343803-344103 0 -
Y_scaffold6 RelocaTE2 not_given 557277 557279 . - . ID=repeat_Y_scaffold6_557277_557279;Name=Copia_LTR/STALKER4_I;TSD=supporting_junction;Note=Non-reference, not found in reference;Right_junction_reads=2;Left_junction_reads=0;Right_support_reads=0;Left_support_reads=1; Y_scaffold6 556904 558935 STALKER4_I:556904-558935 0 +
remove by bedtool
rm: cannot remove ‘/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/repeat/fastq_split’: No such file or directory
/step_6/166.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/167.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/168.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/169.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/170.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/171.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/172.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/173.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/174.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/175.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/176.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/177.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/178.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/179.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/180.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/181.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/182.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/183.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/184.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/185.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/186.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/187.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/188.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_6/189.repeat.absence.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/shellscripts/step_7/0.repeat.characterize.sh
job: sh /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/clean_intermediate_files.sh
python2 /scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/32bdc8a2/bin/relocaTE2.py -t /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered//input/consensus.fasta -g /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered//input/reference.fasta -r /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered//input/repeatmasker.out -o /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/ -s 252 --run -v 4 -c 11 --aligner bowtie2 --len_cut_match 10 --len_cut_trim 10 --mismatch 2 --mismatch_junction 2 -d /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered//input/fastq/ -1 _1 -2 _2
So it looks like it ran for a ways before encountering a problem. Maybe processing of some kind. If you need more information I'll be glad to post it.
- Nice it looks like RelocaTE2 completed now with the switch to bowtie2. The error you posted is actually a bug in the post processing of the RelocaTE2 results (removing redundant predictions) by McClintock and should be an easy fix. I'll let you know when I make this update.
- For now, you should be able to see the raw unfiltered results of RelocaTE2 in the files:
- Non-reference:
/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/repeat/results/ALL.all_nonref_insert.gff
- reference:
/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/test_48/48_relocate2/results/relocaTE2/unfiltered/repeat/results/ALL.all_ref_insert.gff
- Non-reference:
Awesome! We are getting somewhere. It looks like the ALL.all_nonref_insert.gff and ALL.all_ref_insert.gff file are not there but you can find the gff files for all of the contigs both reference and non-reference. Just a quick look around, there were a lot of lines in the output GFFs that resembled this line where no repeat is specified. Is this common place? Example below
3L_1 RelocaTE2 not_given 40375 40466 . + . ID=repeat_3L_1_40375_40466;Name=repeat_name;TSD=supporting_reads;Note=Non-reference, not found in reference;Right_junction_reads=0;Left_junction_reads=0;Right_support_reads=2;Left_support_reads=6;
- Those predictions with no family assigned in the
Name=
attribute are common in the intermediate files but are filtered out by RelocaTE2 of the finalALL.all_nonref_insert.gff
in the runs I've performed. - In the test data, these always have zero junction reads on the right and left which means they would be considered a low quality candidate insertions by RelocaTE2 and filtered out.
I'll pass these notes to @JinfengChen regarding the issues with blat -- I have also been trying bwa-mem2 for some increased speed. Just need to see if the sensitivity issues with bwa vs BLAT on finding reads which contain the TE portion is still giving same results.
@lhemmer I've updated the master branch so the current scripts shouldn't produce the error described in #61 (comment) anymore. Please let me know if you are still having problems running RelocaTE2 with the aligner set to Bowtie2.
Thanks! I finally have the new version installed and I don't think there were any errors I noticed upon installation. But now I'm getting a new error immediately once I run the pipeline on the cluster. I'm not sure what this one is about
SETUP checking fq1: /scratch/alarracu_lab/Lucas/mcclintock_august/mcclintock/900_1.fastq
SETUP checking fq2: /scratch/alarracu_lab/Lucas/mcclintock_august/mcclintock/900_2.fastq
SETUP checking consensus fasta: /scratch/alarracu_lab/centromere_population_genomics/mcc_run_dir/reference/specieslib_mcClintock_2020_mod.fasta
SETUP checking locations gff: /scratch/alarracu_lab/centromere_population_genomics/mcc_run_dir/reference/dmel.chromosomes.fa.TE.mcClintock.2020.gff
SETUP checking taxonomy TSV: /scratch/alarracu_lab/centromere_population_genomics/mcc_run_dir/reference/dmel.chromosomes.fa.TE.mcClintock.2020.tsv
Traceback (most recent call last):
File "/scratch/alarracu_lab/Lucas/mcclintock_august/mcclintock/mcclintock.py", line 487, in <module>
main()
File "/scratch/alarracu_lab/Lucas/mcclintock_august/mcclintock/mcclintock.py", line 35, in main
run_id = make_run_config(args, sample_name, ref_name, full_command, current_directory)
File "/scratch/alarracu_lab/Lucas/mcclintock_august/mcclintock/mcclintock.py", line 330, in make_run_config
mccutils.run_command_stdout(["git","rev-parse","HEAD"], git_commit_file)
File "/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock_august/mcclintock/scripts/mccutils.py", line 120, in run_command_stdout
subprocess.check_call(cmd_list, stdout=out)
File "/scratch/lhemmer/programs/miniconda3/envs/mcclintock_august/lib/python3.8/subprocess.py", line 359, in check_call
retcode = call(*popenargs, **kwargs)
File "/scratch/lhemmer/programs/miniconda3/envs/mcclintock_august/lib/python3.8/subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "/scratch/lhemmer/programs/miniconda3/envs/mcclintock_august/lib/python3.8/subprocess.py", line 854, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/scratch/lhemmer/programs/miniconda3/envs/mcclintock_august/lib/python3.8/subprocess.py", line 1702, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'git'
Oh I titled the conda environment "mcclintock_august" to keep it distinct from the previous installation. Maybe that is where I went wrong
And I think I was getting a similar error running popoolationTE that someone else posted.
mkdir: cannot create directory ‘output/321_all’: File exists
Job counts:
count jobs
1 map_reads
1 median_insert_size
1 ngs_te_mapper_post
1 ngs_te_mapper_run
1 popoolationTE2_post
1 popoolationTE2_preprocessing
1 popoolationTE2_run
1 popoolationTE_post
1 popoolationTE_preprocessing
1 popoolationTE_run
1 process_temp
1 retroseq_post
1 retroseq_run
1 run_temp
1 sam_to_bam
1 setup_reads
1 summary_report
1 telocate_post
1 telocate_run
1 telocate_sam
20
PROCESSING prepping reads for McClintock
PROCESSING running trim_galore &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/logs/20200813.090957.4185990/trimgalore.log
PROCESSING read setup complete
bwa bwasw -t 9 /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/method_input/genome_fasta/dmel_scaffold2_plus0310_2.masked.popoolationTE.fasta /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/method_input/fastq/321_1.fq /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/method_input/fastq/321_2.fq
POPOOLATIONTE2 setting up for PopoolationTE2
POPOOLATIONTE2 indexing reference fasta &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/logs/20200813.090957.4185990/popoolationTE2.log
POPOOLATIONTE2 mapping reads &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/logs/20200813.090957.4185990/popoolationTE2.log
[Thu Aug 13 09:57:58 2020]
Error in rule popoolationTE2_preprocessing:
jobid: 27
output: /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/results/popoolationTE2/unfiltered/sorted.bam
conda-env: /scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/62d1374f
RuleException:
CalledProcessError in line 833 of /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/snakemake/4185990/Snakefile:
Command 'source /scratch/lhemmer/programs/miniconda3/bin/activate '/scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/62d1374f'; set -euo pipefail; /scratch/lhemmer/programs/miniconda3/envs/mcclintock/bin/python3.8 /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/snakemake/4185990/.snakemake/scripts/tmp2a1r1u5e.popoolationte2_pre.py' returned non-zero exit status 1.
File "/gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/snakemake/4185990/Snakefile", line 833, in __rule_popoolationTE2_preprocessing
File "/scratch/lhemmer/programs/miniconda3/envs/mcclintock/lib/python3.8/concurrent/futures/thread.py", line 57, in run
PROCESSING mapping reads to reference &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/logs/20200813.090957.4185990/bwa.log
PROCESSING read mapping complete
Exiting because a job execution failed. Look above for error message
snakemake --use-conda --conda-prefix /scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda --quiet --configfile /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/snakemake/config/config_4185990.json --cores 20 /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/method_input/fastq/321_1.fq /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/results/TEMP/321_temp_nonredundant.bed /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/results/ngs_te_mapper/321_ngs_te_mapper_nonredundant.bed /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/results/retroseq/321_retroseq_nonredundant.bed /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/results/popoolationTE/321_popoolationte_nonredundant.bed /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/results/popoolationTE2/321_popoolationte2_nonredundant.bed /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/results/te-locate/321_telocate_nonredundant.bed /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/321_all/results/summary/summary_report.txt
SETUP checking reference fasta: /scratch/alarracu_lab/centromere_population_genomics/mcc_run_dir/reference/dmel_scaffold2_plus0310_2.fasta
SETUP checking fq1: /scratch/alarracu_lab/Lucas/mcclintock/321_1.fastq
SETUP checking fq2: /scratch/alarracu_lab/Lucas/mcclintock/321_2.fastq
SETUP checking consensus fasta: /scratch/alarracu_lab/centromere_population_genomics/mcc_run_dir/reference/specieslib_mcClintock_2020_mod.fasta
SETUP checking locations gff: /scratch/alarracu_lab/centromere_population_genomics/mcc_run_dir/reference/dmel.chromosomes.fa.TE.mcClintock.2020.gff
SETUP checking taxonomy TSV: /scratch/alarracu_lab/centromere_population_genomics/mcc_run_dir/reference/dmel.chromosomes.fa.TE.mcClintock.2020.tsv
Finally, sometimes there is a weird error where the tmp.bed file produced by popoolationTE will have changed coordinates for some TEs and the start will be larger than the end of some annotation. I double checked to see if this is because the coordinates are mixed up in my bed file and it doesn't appear to be the case. And it's almost random which one has the larger end coordinate with different samples.
count jobs
1 map_reads
1 median_insert_size
1 ngs_te_mapper_post
1 ngs_te_mapper_run
1 popoolationTE2_post
1 popoolationTE2_preprocessing
1 popoolationTE2_run
1 popoolationTE_post
1 popoolationTE_preprocessing
1 popoolationTE_run
1 process_temp
1 retroseq_post
1 retroseq_run
1 run_temp
1 sam_to_bam
1 setup_reads
1 summary_report
1 telocate_post
1 telocate_run
1 telocate_sam
20
PROCESSING prepping reads for McClintock
PROCESSING running trim_galore &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/348_all/logs/20200813.122940.6958755/trimgalore.log
PROCESSING read setup complete
POPOOLATIONTE running PopoolationTE preprocessing steps
POPOOLATIONTE formatting read names
POPOOLATIONTE indexing popoolationTE reference fasta &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/348_all/logs/20200813.122940.6958755/popoolationTE.log
POPOOLATIONTE mapping fastq1 reads &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/348_all/logs/20200813.122940.6958755/popoolationTE.log
POPOOLATIONTE mapping fastq2 reads &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/348_all/logs/20200813.122940.6958755/popoolationTE.log
POPOOLATIONTE combining alignments &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/348_all/logs/20200813.122940.6958755/popoolationTE.log
POPOOLATIONTE sorting sam file &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/348_all/logs/20200813.122940.6958755/popoolationTE.log
POPOOLATIONTE PopoolationTE preprocessing complete
POPOOLATIONTE running PopoolationTE
POPOOLATIONTE getting read length
POPOOLATIONTE calculating median insert size
POPOOLATIONTE converting TE gff to PoPoolationTE known TE file
POPOOLATIONTE running the PoPoolationTE workflow scripts
POPOOLATIONTE identify-te-insertsites.pl
POPOOLATIONTE genomic-N-2gtf.pl
POPOOLATIONTE crosslink-te-sites.pl
POPOOLATIONTE update-teinserts-with-knowntes.pl
POPOOLATIONTE estimate-polymorphism.pl
POPOOLATIONTE filter-teinserts.pl
Error: malformed BED entry at line 6737. Start was greater than end. Exiting.
bedtools sort -i /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/348_all/results/popoolationTE//tmp.bed
POPOOLATIONTE processing PopoolationTE results
[Thu Aug 13 14:34:11 2020]
Error in rule popoolationTE_post:
jobid: 0
output: /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock/output/348_all/results/popoolationTE/348_popoolationte_nonredundant.bed
conda-env: /scratch/alarracu_lab/Lucas/mcclintock/install/envs/conda/70398645
- @lhemmer thanks for letting me know about these issues, I'll do my best to address them.
- #61 (comment) that is because I added a step that gets the commit version of the mcclintock repository using
git rev-parse HEAD
. It looks like this step is failing for you. Perhaps you don't have git installed on the cluster you are running the pipeline? I can probably add git to themcclintock
base environment so this issue won't occur again. - #61 (comment) the name of the base environment shouldn't have any impact so you can name it whatever you like. As far as this popoolationTE2 issue, I can't tell what is going on from this error other than that it looks like it failed during popoolationTE2 preprocessing. The
popoolationTE2.log
should be more informative. - I'll take a look at: #61 (comment) but this may be an error caused by popoolationTE or my processing of the output. To determine what is going on here, can you post the raw popoolationTE output file:
results/popoolationTE/unfiltered/te-poly-filtered.txt
in a run where the error is occurring? That way I can check if the coordinate error occurs there too. If not then it could be a bug in my post processing. The start and stop positions extracted from the popoolationTE output file changes depending on the type of support so it's possible there is some faulty logic in how I am processing it.
Sure thing. For the first comment, I did include git in the base environment when running and so far its been running smoothly. Except that my sample halted popoolationTE because of the same error where the bed file end position was greater than the start. I'm pretty sure this is some processing done by popoolationTE, I've seen this sometimes when running the past iterations of McClintock. I'll attach it here
For this example,
Error: malformed BED entry at line 7254. Start was greater than end. Exiting.
And here is the popoolationTE2 log for the previous comment
popoolationTE2.log
- The popoolationTE2.log shows an error message
[vfprintf(stdout)] Disk quota exceeded
which usually means that you have run out of the allotted storage on the system you are running on. You probably need to remove some files from the system to free up storage for the run to complete - The position error is present in the popoolationTE file you posted. To work around this, I am going to add a step that separates malformed entries in a different file, that way the pipeline can complete even if this error occurs. I'll let you know when these changes are added to the master branch.
Whoops, I should have caught the Disk quota exceeded line. However, the changing coordinates error in popoolationTE confuses me. I'm not sure why the error occurs since by what I've seen the ones with the end position greater than the start happen to be reference sequences. Thanks for the help!
I updated the pipeline 3b38f98 so that it should catch any predictions that are malformed with a start greater than the end position and place them in a separate results/*/*_malformed.bed
file. This should allow runs to complete even if a component method produces a malformed prediction. I haven't been able to replicate the issue with the datasets I've used so far but I'll keep an eye out for this and see if I can figure out if there is a pattern to it. Let me know if you are still having problems.
Hello, the new update has been working great with bypassing some of the malformed bed files. I am now having specific issues with popoolationTE2 with certain samples. In the general output I get this error:
java -jar /scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/install/tools/popoolationte2/popte2-v1.10.03.jar ppileup --bam /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/output/32_pop2/results/popoolationTE2/unfiltered/sorted.bam --hier /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/output/32_pop2/results/popoolationTE2/unfiltered//input.taxonomy.txt --map-qual 15 --sr-mindist 10000 --id-up-quant 0.01 --output /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/output/32_pop2/results/popoolationTE2/unfiltered//output.ppileup.gz
POPOOLATIONTE2 running PopoolationTE2
POPOOLATIONTE2 making physical pileup file &> /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/output/32_pop2/logs/20200903.095759.4494532/popoolationTE2.log
[Thu Sep 3 10:07:27 2020]
Error in rule popoolationTE2_run:
jobid: 3
output: /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/output/32_pop2/results/popoolationTE2/unfiltered/teinsertions.txt
conda-env: /scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/install/envs/conda/a2b02106
So when I take a look at the popoolatationTE2 log, I get this error at the very end (the file is a little big to post here.
Exception in thread "main" java.lang.IllegalStateException: Can not start writing at 2168 ;construction done until 2170
at corete.data.ppileup.PpileupBuilder.addFromTo(PpileupBuilder.java:286)
at corete.data.ppileup.PpileupBuilder.addRead(PpileupBuilder.java:185)
at corete.data.ppileup.PpileupMultipopBuilder.spool2Position(PpileupMultipopBuilder.java:118)
at corete.data.ppileup.PpileupMultipopBuilder.buildPpileup(PpileupMultipopBuilder.java:79)
at pt2.ppileup.PpileupFramework.run(PpileupFramework.java:95)
at pt2.ppileup.PpileupParser.parseCommandline(PpileupParser.java:102)
at pt2.Main.main(Main.java:39)
java -jar /scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/install/tools/popoolationte2/popte2-v1.10.03.jar ppileup --bam /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/output/32_pop2/results/popoolationTE2/unfiltered/sorted.bam --hier /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/output/32_pop2/results/popoolationTE2/unfiltered//input.taxonomy.txt --map-qual 15 --sr-mindist 10000 --id-up-quant 0.01 --output /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/output/32_pop2/results/popoolationTE2/unfiltered//output.ppileup.gz
java -jar /scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/install/tools/popoolationte2/popte2-v1.10.03.jar ppileup --bam /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/output/32_pop2/results/popoolationTE2/unfiltered/sorted.bam --hier /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/output/32_pop2/results/popoolationTE2/unfiltered//input.taxonomy.txt --map-qual 15 --sr-mindist 10000 --id-up-quant 0.01 --output /gpfs/fs2/scratch/alarracu_lab/Lucas/mcclintock_august2/mcclintock/output/32_pop2/results/popoolationTE2/unfiltered//output.ppileup.gz
Once again it's only select samples that produce this error and this only happens with popoolationTE2. The rest of the programs work fine. I'm not sure what to make of this
- This error is occurring within popoolationTE2 when creating the ppileup file. I can't tell from the error why this is happening though and I've never had it happen with datasets I've run.
- Is it random? or if you run popoolationTE2 again with the same dataset that failed, do you get the same error?
The same error takes place with the same sample. I double-checked to see if there was something corrupted in the fastq file but it doesn't appear to be the case. The error is sample specific i.e. Can not start writing at 2168 ;construction done until 2170
is consistent with the 32 sample listed, and there are two different values for the other samples that also trip this error. Right now there's only three samples out of 150 that do this so it does not happen very often, just curious that it happens specifically while running PopoolationTE2 and every other program finishes successfully
- Unfortunately I don't know enough about the popoolationTE2 source code to know what could cause this error, so I can't say if something in the mcclintock pre-processing is causing it or if something could be modified to prevent it.
- To resolve this, I'd suggest trying to run popoolationTE2 with your data outside of mcclintock. If the issue persists, we may have to ask the developers if they know what is going on: https://sourceforge.net/p/popoolation-te2/discussion/general/ though I am not sure if they are still maintaining it or not.
- popoolationTE2 dependencies can be installed using this yaml file: https://github.com/bergmanlab/mcclintock/blob/master/install/envs/mcc_popoolationte2.yml
- popoolationTE2 instructions on pre-processing: https://sourceforge.net/p/popoolation-te2/wiki/WalkthroughPreparatoryWork/
- popoolationTE2 instructions on how to run: https://sourceforge.net/p/popoolation-te2/wiki/Walkthrough/
- Alternatively, if we could work out a data transfer for a sample that is causing this error, I'd be happy to look into it myself. As it's happening with more than one of your samples it's possible this problem will show up for others as well, so I'd like to get to the bottom of it, or at least, implement some workaround or patch.
- If you are OK with sharing a dataset, we can work out the details via email
- @lhemmer: I contacted the developer of popoolationTE2 who said this looks like an issue with building the physical pileup of reads in popoolationTE2 that he has not seen before. He said to reach out to him directly to sort this out. When you think it is resolved please post here and we will update McClintock to use a newer version of popoolationTE2.
- I'm going to close this issue since it was initially related to Relocate/Relocate2 and the popoolationTE2 issue is external to McClintock.