mgalardini/pyseer

Possible problem with VCF file meaning 0 tested/printed variants

michhulin opened this issue · 2 comments

Hi,

I'm trying to run pyseer with a vcf file. Here is my command output. I think there may be a problem with my input vcf file as it appears none of the variants get passed to be tested by the program. However, I'm unsure what is wrong.

Many thanks!

pyseer --phenotypes psp_pheno --vcf out.vcf --distances dist --min-af 0.01 --max-af 0.99 --cpu 15 --filter-pvalue 1E-8 > pyseer.assoc-snp2

Read 55 phenotypes
Detected binary phenotype
Structure matrix has dimension (55, 55)
Analysing 55 samples found in both phenotype and structure matrix
23018 loaded variants
23018 pre-filtered variants
0 tested variants
0 printed variants

This is the VCF format

##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Reference genome=GCF_000012205
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##bcftools_normVersion=1.9-64-g28bcc56+htslib-1.9-52-g6e86e38
##bcftools_normCommand=norm -m - VCF.GCF_000012205.vcf; Date=Fri Jan 19 14:44:29 2024
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 0886-19 1149B 1302A 1390 82-HI B_21-22 B_21-3 B_21-60 B_21-69 B_21-76 B_21-79 B_21-8 B_22-1 B_22-3 B_22-4 B_22-5 B_22-8 GCF_000012205 GCF_001294035 GCF_001294065 GCF_001294105 GCF_001294265 GCF_001400605 GCF_003412855 GCF_003412865 GCF_003412905 GCF_003412915 GCF_003412965 GCF_003412985 GCF_003412995 GCF_003413015 GCF_003413035 GCF_003413075 GCF_003413095 GCF_003413115 GCF_003413145 GCF_003413155 GCF_003413175 GCF_003413195 GCF_003413225 GCF_003413235 GCF_003413245 GCF_003413305 GCF_003413345 GCF_003413365 GCF_003413375 GCF_003413385 GCF_003413425 GCF_003413445 GCF_003700295 GCF_003701825 GCF_003703035 R12-ID R2NY R2QHB
1 1742643 AAAAAAAA.CCCCGCCT_F G C . . NS=47;AF=0.018 GT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . 0 . . 0 . 0 0 0 0 0 1 . 0 0 0 0 0 0 0 0 0
. 0 0 . . 0 0 0 0 0 0 0

The usual cause of this is that sample names do not match between VCF and phenotypes.

You may also not have enough observations, 55 samples is small. The variant shown above has a frequency of 1.8% so should be included. But really with this many samples you should change the range to at least 5% to 95%, because with the current filters you are including singletons.

I would also suggest removing the --filter-pvalue option.

Hi John,

Great thank you removing the filter allowed it to work.

Many thanks
Michelle