CFSAN-Biostatistics/snp-pipeline

Error in SNP filtration step while running with the run option

parul-sharma opened this issue · 6 comments

Hello,
I have tested the software on the Listeria genomes with the run option and it worked fine. I am now trying on a set of 50 genomes and the pipeline shuts down each time at the abnormal SNP filtration step with the following error:

Traceback (most recent call last):
  File "/home/parulsharma/miniconda3/envs/snp-pipeline/bin/cfsan_snp_pipeline", line 10, in <module>
    sys.exit(main())
  File "/home/parulsharma/miniconda3/envs/snp-pipeline/lib/python2.7/site-packages/snppipeline/cfsan_snp_pipeline.py", line 645, in main
    return run_command_from_arg_list(sys.argv[1:])
  File "/home/parulsharma/miniconda3/envs/snp-pipeline/lib/python2.7/site-packages/snppipeline/cfsan_snp_pipeline.py", line 606, in run_command_from_arg_list
    return run_command_from_args(args)
  File "/home/parulsharma/miniconda3/envs/snp-pipeline/lib/python2.7/site-packages/snppipeline/cfsan_snp_pipeline.py", line 585, in run_command_from_args
    args.func(args)  # this executes the function previously associated with the subparser with set_defaults
  File "/home/parulsharma/miniconda3/envs/snp-pipeline/lib/python2.7/site-packages/snppipeline/filter_regions.py", line 200, in filter_regions
    filter_regions_across_samples(list_of_vcf_files, contig_length_dict, sorted_list_of_outgroup_samples, force_flag, edge_length, window_size_list, max_num_snps_list, ref_fasta_path, out_group_list_path)
  File "/home/parulsharma/miniconda3/envs/snp-pipeline/lib/python2.7/site-packages/snppipeline/filter_regions.py", line 277, in filter_regions_across_samples
    collect_dense_regions(vcf_reader, bad_regions_dict, contig_length_dict, edge_length, max_num_snps_list, window_size_list)
  File "/home/parulsharma/miniconda3/envs/snp-pipeline/lib/python2.7/site-packages/snppipeline/filter_regions.py", line 409, in collect_dense_regions
    for vcf_data_line in vcf_reader:
  File "/home/parulsharma/miniconda3/envs/snp-pipeline/lib/python2.7/site-packages/vcf/parser.py", line 547, in next
    pos = int(row[1])
IndexError: list index out of range
Error detected while running cfsan_snp_pipeline filter_regions.

The command line was:
    cfsan_snp_pipeline filter_regions -n var.flt.vcf /work/cascades/parulsharma/Ralstonia_analysis//output/sampleDirectories.txt /work/cascades/parulsharma/Ralstonia_analysis//output/reference/MolK2.fna --edge_length 500 --window_size 1000 125 15 --max_snp 3 2 1 --verbose 1 --mode all

IndexError exception in function next at line 547 in file /home/parulsharma/miniconda3/envs/snp-pipeline/lib/python2.7/site-packages/vcf/parser.py
    pos = int(row[1])

I am using the latest version 2.2.0 of snp-pipeline and all other dependencies are the suggested versions as per the documentation.

@parul-sharma Thank you for reaching out. One of the VCF files is not in the expected format (missing the position number). It is causing an exception in the pyvcf library. I can try to detect this condition and work around the problem in a future release. For now, I suggest we find and examine the mal-formed VCF file and then we can decide how to proceed. Would it be possible for you to make the VCF files available to me to further diagnose the problem? You can zip the files with the command below.
zip vcf.zip samples/*/var.flt.vcf

Hi Dr Steven,
Thanks for your quick reply.
Here are the requested vcf files.
vcf.zip

There was Java memory allocation error. See samples/UW492/var.flt.vcf. Delete the file samples/UW492/var.flt.vcf and try rerunning the pipeline.

Thanks for finding the issue. I deleted that file and rerun the pipeline. It still gave me the same error but with some other file this time. I also tried running it on cluster with significantly more memory but I see that at least 8 of my samples still end up with corrupted vcf files due to this java memory allocation error.
Is there a way to work around this problem. I do understand that this is a little out of scope from your software and I really appreciate you taking the time to help me.

It sounds like concurrent processes are competing for available memory. You can make some adjustments to the configuration file to reduce the number of concurrent processes. See these links for the documentation:
https://snp-pipeline.readthedocs.io/en/latest/faq.html#performance
https://snp-pipeline.readthedocs.io/en/latest/configuration.html

Changing the parameters in the configuration file totally worked! Thanks for suggesting that.