gymrek-lab/TRTools

reheader for mergeSTR?

psychrb opened this issue · 6 comments

Using trtools 4.0 now, please clarify what the re-header of the vcf file output of adVNTR should be?? Other contigs??Not exactly clear to me from reading documentation.
When I try mergeSTR of vcf files that have been sorted, zipped, and indexed, I get error...I do have .gz and .tbi files so I'm wondering if the header is the issue..I have sorted the vcf files..
"Make sure FILENAME.gz is bgzipped and indexed"..
This is command line:
mergeSTR --vcfs COMMASEPARATEDLIST OF SORTEDVCF FILES.gz --vcftype adVNTR --out adVNTR_Merged

Header of vcf output from adVNTR is below:
##fileformat=VCFv4.3
##source=adVNTR ver. 1.4.1
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of variant">
##INFO=<ID=VID,Number=1,Type=Integer,Description="VNTR ID">
##INFO=<ID=RU,Number=1,Type=String,Description="Repeat motif">
##INFO=<ID=RC,Number=1,Type=Integer,Description="Reference repeat unit count">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=SR,Number=1,Type=Integer,Description="Spanning read count">
##FORMAT=<ID=FR,Number=1,Type=Integer,Description="Flanking read count">
##FORMAT=<ID=ML,Number=1,Type=Float,Description="Maximum likelihood">
##contig=<ID=1>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=2>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=X>
##contig=<ID=Y>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT FILENAME

nmmsv commented

Hello,
It seems like the issue is with lack of index file. Can you please confirm that all name.vcf.gz files have an accompanying name.vcf.gz.tbi file? You can generate this file by running tabix -p vcf name.vcf.gz

Thanks. Yes, I have done this. I have already sorted, zipped and run tabix -p on all files..So I have gz and gz.tbi files...for all sorted vcf files, but still get this error. Not due to header then? I didn't re-head the vcf files

nmmsv commented

I just checked the previous issue reported by you, and it seems like the header contigs don't match the VCF lines. The contigs in VCF lines are in the chrN format, but the contigs are just N. Can you try adjusting that to the correct format (header also be chrN) to see if that helps with anything?
The error should not be thrown for a header issue though, so that's still odd.

OK.
-So first, I was able to resolve the error with gzip and index error by simply renaming files. I had .sort as a suffix rather than .vcf.
I have a different error now pertaining to the header if you can advise...Which ref.fa.fai file am I supposed to use here?? Or how to modify header?
-This error is the same whether I use vcf header with contigs with 'CHR' or without 'CHR'.

Traceback (most recent call last):
File "/hpc/packages/minerva-centos7/py_packages/3.7/bin/mergeSTR", line 8, in
sys.exit(run())
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/trtools/mergeSTR/mergeSTR.py", line 563, in run
retcode = main(args)
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/trtools/mergeSTR/mergeSTR.py", line 521, in main
useinfo, useformat = WriteMergedHeader(vcfw, args, vcfreaders, " ".join(sys.argv), vcftype)
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/trtools/mergeSTR/mergeSTR.py", line 90, in WriteMergedHeader
"Different contigs found across VCF files. Make sure all "
ValueError: Different contigs found across VCF files. Make sure all files used the same reference. Consider using this command:
bcftools reheader -f ref.fa.fai file.vcf.gz -o file_rh.vcf.gz

HEADER IS BELOW
##fileformat=VCFv4.3
##source=adVNTR ver. 1.4.1
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of variant">
##INFO=<ID=VID,Number=1,Type=Integer,Description="VNTR ID">
##INFO=<ID=RU,Number=1,Type=String,Description="Repeat motif">
##INFO=<ID=RC,Number=1,Type=Integer,Description="Reference repeat unit count">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=SR,Number=1,Type=Integer,Description="Spanning read count">
##FORMAT=<ID=FR,Number=1,Type=Integer,Description="Flanking read count">
##FORMAT=<ID=ML,Number=1,Type=Float,Description="Maximum likelihood">
##contig=<ID=chr1>
##contig=<ID=chr10>
##contig=<ID=chr11>
##contig=<ID=chr12>
##contig=<ID=chr13>
##contig=<ID=chr14>
##contig=<ID=chr15>
##contig=<ID=chr16>
##contig=<ID=chr17>
##contig=<ID=chr18>
##contig=<ID=chr19>
##contig=<ID=chr2>
##contig=<ID=chr20>
##contig=<ID=chr21>
##contig=<ID=chr22>
##contig=<ID=chr3>
##contig=<ID=chr4>
##contig=<ID=chr5>
##contig=<ID=chr6>
##contig=<ID=chr7>
##contig=<ID=chr8>
##contig=<ID=chr9>
##contig=<ID=chrX>
##contig=<ID=chrY>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT FILENAME

nmmsv commented

It seems like different VCFs have different contigs in their headers. Can you confirm that all the VCF headers include the same contigs?

Yes identified. Corrected and Resolved, mergeSTR appears to work now. Thanks!