gymrek-lab/TRTools

MergeSTR of adVNTR Files

Closed this issue · 8 comments

For mergeSTR of adVNTR output files, after sorting vcf files, bgzip and index...
I am getting error for command line mergeSTR --vcfs Sample1.vcf.gz, Sample2.vcg.gz --out OUT
See below. Each vcf.gz file contains adVNTR output for one sample for all chromosomes.
Please clarify the bcftools reheader config line that needs to be added here, if each vcf.gz file contains calls across chromosomes?? The example on GitHub seems to add : ##contig=<ID=21> for NA12892_chr21_advntr.sorted.vcf.gz

Traceback (most recent call last):
File "/hpc/packages/minerva-centos7/py_packages/3.7/bin/mergeSTR", line 8, in
sys.exit(run())
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/trtools/mergeSTR/mergeSTR.py", line 505, in run
retcode = main(args)
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/trtools/mergeSTR/mergeSTR.py", line 476, in main
current_records = [next(reader) for reader in vcfreaders]
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/trtools/mergeSTR/mergeSTR.py", line 476, in
current_records = [next(reader) for reader in vcfreaders]
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/vcf/parser.py", line 547, in next
pos = int(row[1])
ValueError: invalid literal for int() with base 10: ''

It seems to be indicating that your VCF has a variant where the POS column is empty. Is that the case?

We're going through a large update to improve performance that should be done soon. If there is a bug here on our end, it will be easier to fix once that is complete.

Upon my check there are no empty values in POS column. Could anything else be contributory..? Thanks.

Upon my check there are no empty values in POS column. Could anything else be contributory..? Thanks.

This looks like a PyVCF parsing issue. (As @LiterallyUniqueLogin mentioned we are almost done switching over to a different VCF parsing library). Does this error happen right away or is there some output first? If it doesn't die right away, it could be that one of the files is truncated.

The bcftools reheader line should add header lines for all relevant chromosomes. But this error doesn't seem to point to the contig header lines being an issue.

If that doesn't seem to be the explanation, if you could share any example files that give you the error we could better debug it.

I think the issue is adVNTR output rather than PyVCF parsing. While the POS column has not empty values, FILTER with many "." values..I think FILTER needs to "PASS" in order to proceed, no?....

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##source=adVNTR ver. 1.4.1
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of variant">
##INFO=<ID=VID,Number=1,Type=Integer,Description="VNTR ID">
##INFO=<ID=RU,Number=1,Type=String,Description="Repeat motif">
##INFO=<ID=RC,Number=1,Type=Integer,Description="Reference repeat unit count">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=SR,Number=1,Type=Integer,Description="Spanning read count">
##FORMAT=<ID=FR,Number=1,Type=Integer,Description="Flanking read count">
##FORMAT=<ID=ML,Number=1,Type=Float,Description="Maximum likelihood">
##contig=<ID=1>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=2>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=X>
##contig=<ID=Y>
##bcftools_viewVersion=1.9+htslib-1.9
##bcftools_viewCommand=view -h LIBD01_advntr.vcf; Date=Mon Jan 11 22:25:28 2021
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE_01
chr1 700243 . CGAGGCGGGAAGATCACTTGATATCAGGAGTCGAGGCGGGAAGATCACTTGACGTCAGGAGTT . . . END=700306;VID=156;RU=CGAGGCGGGAAGATCACTTGACATCAGGAGTT;RC=2 GT:DP:SR:FR:ML 0/0:34:13:21:1.0000
chr1 777293 . AGAAATGCTTCTTTAGAAATGCTTCTTT . . . END=777321;VID=174;RU=AGAAATGCTTCTTT;RC=2 GT:DP:SR:FR:ML 0/0:37:20:17:1.0000
chr1 900683 . TTTATTTAGTTGTTTATTTATTTAGTTATTTATCTTATTTATTGAGGGG . . . END=900732;VID=239;RU=TTTATTTAGTTATTTA;RC=3 GT:DP:SR:FR:ML 0/0:45:13:32:1.0000

I agree, by issue with PyVCF parsing, I meant a problem in the VCF that confuses the parser rather than an error in PyVCF itself.

It should be ok to have a "." in FILTER. It does not need to be "PASS" to continue.

I tested reading this chunk of the VCF in PyVCF and had no error. So I still wonder if the error comes from lines further down, perhaps that might have gotten truncated or are somehow formatted incorrectly.

If you are still having issues, you are welcome to send the VCF to mgymrek AT eng DOT ucsd DOT edu and one of us can try to reproduce the error.

Hi there,

TRTools 4.0.0 has been released! The code base has been overhauled and there are many bugfixes. I don't know if those will solve your problem or not, but please try running the code again with the new update and let us know what happens.

OK. Thanks for letting me know about 4.0.0 release. I'm re-running adVNTR now too after some adjustment, and will contact you if issues then with mergeSTR with updated trtools. Thanks.