`sim` segmentation error
jennieli421 opened this issue · 7 comments
The error I got:
/var/spool/slurmd/job10576850/slurm_script: line 34: 22333 Segmentation fault python UNCALLED/scripts/uncalled sim $bwa_prefix $path_ctl_fast5s --ctl-seqsum $path_ctl_seqsum --unc-seqsum $path_unc_seqsum --unc-paf $path_unc_paf -t 16 --enrich -c 3 --sim-speed 0.25 > uncalled_out.paf 2> uncalled_err.txt
In uncalled_err.txt
:
Loading UNCALLED PAF............
================================
Procesing run...................
Generating pattern..............
================================
Loading control PAF.............
==========
Procesing run...................
Ordering reads..................
================================
I think the problem is that I provided a txt file storing all the paths to actual fast5 files, and not a directory of fast5 files.
Below is the script I run:
bwa_prefix="sim_data/viral_genome_ref"
ref_genome="sim_data/viral.1.1.genomic.fna.gz"
path_ctl_fast5s="sim_data/NA12878-DirectRNA_subset.files.txt"
path_ctl_seqsum="sim_data/NA12878-DirectRNA_subset_Guppy_4.2.2_sequencing_summary.txt"
path_unc_seqsum="sim_data/20191220_GM12878_seqsum.txt"
path_unc_paf="sim_data/20191220_GM12878_uncalled.paf"
python UNCALLED/scripts/uncalled sim $bwa_prefix $path_ctl_fast5s --ctl-seqsum $path_ctl_seqsum --unc-seqsum $path_unc_seqsum --unc-paf $path_unc_paf -t 16 --enrich -c 3 --sim-speed 0.25 > uncalled_out.paf 2> uncalled_err.txt
What should I do if I want to pass a txt file?
I tried sim
with a different set of control where I provided the directory to the actual fast5 files, but still got the same error.
/var/spool/slurmd/job10576865/slurm_script: line 34: 22175 Segmentation fault python UNCALLED/scripts/uncalled sim $bwa_prefix $path_ctl_fast5s --ctl-seqsum $path_ctl_seqsum --unc-seqsum $path_unc_seqsum --unc-paf $path_unc_paf -t 16 --enrich -c 3 --sim-speed 0.25 > uncalled_out.paf 2> uncalled_err.txt
Below are the codes I run:
bwa_prefix="sim_data/viral_genome_ref"
ref_genome="sim_data/viral.1.1.genomic.fna.gz"
path_ctl_fast5s="/athena/tilgnerlab/scratch/caf4010/3_8_23_UNCALLED/Example_Input/fast5_pass"
path_ctl_seqsum="/athena/tilgnerlab/scratch/caf4010/3_8_23_UNCALLED/Example_Input/sequencing_summary_PAG69730_bbfa25c2.txt"
path_unc_seqsum="sim_data/20191220_GM12878_seqsum.txt"
path_unc_paf="sim_data/20191220_GM12878_uncalled.paf"
python UNCALLED/scripts/uncalled sim $bwa_prefix $path_ctl_fast5s --ctl-seqsum $path_ctl_seqsum --unc-seqsum $path_unc_seqsum --unc-paf $path_unc_paf -t 16 --enrich -c 3 --sim-speed 0.25 > uncalled_out.paf 2> uncalled_err.txt
This has come up before (#42), and unfortunately and I wasn't able to reproduce the error at the time. Was a "core dump" file produced for the segfault? If so, can you share it with me? They are not always produced by default, but you should be able configure your machine to generate one.
Unfortunately this is a hard problem to debug, since the simulator will produce different results depending on how fast your computer can map reads, so it's fundamentally non-deterministic. Plus it only seems to pop up in large datasets, and running a debugger slows everything down and seems to prevent the error from occurring.
Is it possible for you to provide the control fast5 and control sequencing summary files that have run successful simulations before? I will test it on my end to ensure it is not caused by input files.
Yes, the sequencing summary files for the two samples we've tested can be found here: https://labshare.cshl.edu/shares/schatzlab/www-data/UNCALLED/simulator_files/
And the raw signal files are here: https://www.ncbi.nlm.nih.gov/sra/SRX9270076[accn] https://www.ncbi.nlm.nih.gov/sra/SRX9568954[accn]
Thank you very much. Just to confirm, is the reference genome the E.coli.fasta
? Also, is the sequencing summary for ctl and unc the same txt file?
For the Zymo mock microbial simulation in the paper it was actually a reference containing all bacteria, which we "depleted" to increase the yeast yield using the parameters --deplete -C 10
. Here is the reference: zymo_bacteria.fa.gz (originally obtained from here)
And actually the control sequencing summary is different, sorry I forgot about that! Here it is:
zymo_control_sequencing_summary.txt.gz
There is segmentation fault again after 2 hours.
err.txt
:
uncalled_err.txt
my script:
# MAIN SCRIPT
bwa_prefix="sim_data/E.coli"
ref_genome="sim_data/E.coli.fa"
path_ctl_fast5s="sim_data/20190809_zymo_control/fast5"
path_ctl_seqsum="sim_data/zymo_control_sequencing_summary.txt"
path_unc_seqsum="sim_data/20190809_zymo_seqsum.txt"
path_unc_paf="sim_data/20190809_zymo_uncalled.paf"
# python UNCALLED/scripts/uncalled index -o sim_data/E.coli sim_data/E.coli.fa
python UNCALLED/scripts/uncalled sim $bwa_prefix $path_ctl_fast5s \
--ctl-seqsum $path_ctl_seqsum \
--unc-seqsum $path_unc_seqsum \
--unc-paf $path_unc_paf \
-t 16 --enrich -c 3 --sim-speed 0.25 > uncalled_out.paf 2> uncalled_err.txt