ksamuk/pixy

Pandas error: too many columns specified

sabrinamostoufi opened this issue · 2 comments

I've been trying to use pixy to estimate pi and dxy for 2 D. melanogaster samples, but I keep running into the same error.

"pandas.errors.ParserError: Too many columns specified: expected 2 and found 1"

I've double and triple-checked the sample names between my VCF and populations.txt files, and checked that there are no extra characters in my populations.txt file. I'm stumped!

The full pixy command I used:
pixy --stats pi dxy --vcf Parents_AllSites.vcf.gz --populations populations.txt --window_size 10000

A subset of my VCF, created using GATK:
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
[...]
##reference=file:///gpfs/projects/singhlab/smostouf/smostouf_WolRecomb/ParentSeqs/dmel-all-chromosome-r6.41.fasta
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT RAL321_SRR8177612 RAL790_SRR8177521
2L 5904 . C A 58.17 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.08;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:84,6,0 ./.:0,0:0:.:0,0,0
2L 5974 . C T 23.19 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=11.60;SOR=0.693 GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0 1/1:0,2:2:6:49,6,0

My populations file:
RAL321_SRR8177612 ABC
RAL790_SRR8177521 DEF

OS information: MacOS Big Sur v11.7.2

Hi there, the first thing to check would be to confirm your populations file is tab separated.

Thank you, I have it running now! The text editor I was using was inserting spaces when I used the Tab button, so that was causing the error.