ksamuk/pixy

IndexError: list index out of range

Pcariman opened this issue · 2 comments

Hi !
I am working with whole genome data of 20 individuals from two populations, to avoid memory limitation issues I generated a VCF Allsites file for chromosome/contig 'Sc1nRTI_147;HRSCAF=971'.
When running pixy I got the following error message:

pixy --stats pi fst dxy --vcf 2_concat_allsites.vcf.gz --populations popfile.txt --window_size 10000 --n_cores 4 --chromosomes 'Sc1nRTI_147;HRSCAF=971'
[pixy] pixy 1.2.7.beta1
[pixy] See documentation at https://pixy.readthedocs.io/en/latest/

[pixy] Validating VCF and input parameters...
[pixy] Checking write access...OK
[pixy] Checking CPU configuration...OK
[pixy] Checking for invariant sites...OK
[pixy] Checking chromosome data...OK
[pixy] Checking intervals/sites...OK
[pixy] Checking sample data...OK
[pixy] All initial checks past!

[pixy] Preparing for calculation of summary statistics: pi, fst, dxy
[pixy] Using Weir and Cockerham (1984)'s estimator of FST.
[pixy] Data set contains 2 population(s), 1 chromosome(s), and 20 sample(s)
[pixy] Window size: 10000 bp

[pixy] Started calculations at 12:00:38 on 2023-01-23
[pixy] Using 10 out of 48 available CPU cores

[pixy] Processing chromosome/contig Sc1nRTI_147;HRSCAF=971...
IndexError: list index out of range

A reproducible example of the problem
If your question is about how your dataset is specifically being processed by pixy, please copy and paste the following:
(1) The command I used to run pixy is:
pixy --stats pi fst dxy --vcf 2_concat_allsites.vcf.gz --populations popfile.txt --window_size 10000 --n_cores 4 --chromosomes 'Sc1nRTI_147;HRSCAF=971'
(2) A subset of my VCF is:
The uncompressed VCF file weighs 6.2 G

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftoolsCommand=mpileup --threads 30 -Oz --output 1_pileup_sar_te.vcf.gz -f ../../../genoma_ref/jasmine-uni1728-mb-hirise-3bs35_08-29-2020__hic_output.fasta --bam-list SARTE.txt -r Sc1nRTI_147;HRSCAF=971
##reference=file://../../../genoma_ref/jasmine-uni1728-mb-hirise-3bs35_08-29-2020__hic_output.fasta
##contig=<ID=Sc1nRTI_1;HRSCAF=9,length=9724429>
##contig=<ID=Sc1nRTI_2;HRSCAF=53,length=43341626>
##contig=<ID=Sc1nRTI_3;HRSCAF=77,length=26245687>
##contig=<ID=Sc1nRTI_4;HRSCAF=306,length=762119>
##contig=<ID=Sc1nRTI_5;HRSCAF=352,length=47867>
##contig=<ID=Sc1nRTI_6;HRSCAF=398,length=490273>
##contig=<ID=Sc1nRTI_7;HRSCAF=419,length=156346>
##contig=<ID=Sc1nRTI_8;HRSCAF=480,length=173972>
.
.
##contig=<ID=Sc1nRTI_147;HRSCAF=971,length=52070233>
.
.
##contig=<ID=Sc1nRTI_1351;HRSCAF=2611,length=4789>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred-scaled Genotype Quality">
##INFO=<ID=ICB,Number=1,Type=Float,Description="Inbreeding Coefficient Binomial test (bigger is better)">
##INFO=<ID=HOB,Number=1,Type=Float,Description="Bias in the number of HOMs number (smaller is better)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_callCommand=call -m -f GQ -Oz -o 1_called_sar_te.vcf.gz 1_pileup_sar_te.vcf.gz; Date=Wed Jan 11 11:35:23 2023
##bcftools_filterVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_filterCommand=filter -Oz -o 1.1_CALLED_DP.vcf.gz -i 'DP >= 4 && DP <= 500' ../1_called_sar_te_GL02.vcf.gz; Date=Mon Jan 16 10:48:12 2023
##bcftools_concatVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_concatCommand=concat --allow-overlaps -O z -o 3_concat_allsites_sar_te.vcf.gz invariants.vcf.gz variants.vcf.gz; Date=Mon Jan 16 11:39:16 2023
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sar_TE02_aln_sorted.bam Sar_TE03_aln_sorted.bam Sar_TE04_aln_sorted.bam Sar_TE05_aln_sorted.bam Sar_TE06_aln_sorted.bam Sar_TE07_aln_sorted.bam Sar_TE11_aln_sorted.bam Sar_TE12_aln_sorted.bam Sar_TE14_aln_sorted.bam Sar_TE16_aln_sorted.bam
Sc1nRTI_147;HRSCAF=971 7 . A . 74 PASS . GT 0/0 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 ./.
Sc1nRTI_147;HRSCAF=971 8 . A . 81 PASS . GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./.
Sc1nRTI_147;HRSCAF=971 9 . A . 81 PASS . GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./.
Sc1nRTI_147;HRSCAF=971 10 . A . 86 PASS . GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./.
Sc1nRTI_147;HRSCAF=971 11 . T . 88 PASS . GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./.
.
.
.
Sc1nRTI_147;HRSCAF=971 598 . C T 999 PASS . GT:PL:GQ ./.:0,0,0:0 1/1:255,54,0:61 1/1:84,9,0:16 1/1:99,9,0:16 0/1:105,0,75:67 1/1:90,6,0:13 1/1:148,15,0:22 1/1:116,9,0:16 0/1:31,0,13:6 1/1:165,21,0:28
Sc1nRTI_147;HRSCAF=971 599 . C . 999 PASS . GT ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
Sc1nRTI_147;HRSCAF=971 600 . C T 129 PASS . GT:PL:GQ ./.:0,0,0:0 0/1:89,0,255:84 0/0:0,15,136:19 0/0:0,9,102:14 0/0:0,24,192:28 0/1:90,6,0:6 0/0:0,15,151:19 0/0:0,9,134:14 0/0:0,6,52:11 0/0:0,24,177:28
.
.
.

(2) and the full populations file, separated by tabs where the first column contains sample names (exactly as represented in the VCF) and the second column contains population names:
$popfile.txt
Ssc_TE01_aln_sorted.bam TE
Ssc_TE02_aln_sorted.bam TE
Ssc_TE03_aln_sorted.bam TE
Ssc_TE04_aln_sorted.bam TE
Ssc_TE05_aln_sorted.bam TE
Ssc_TE06_aln_sorted.bam TE
Ssc_TE07_aln_sorted.bam TE
Ssc_TE08_aln_sorted.bam TE
Ssc_TE11_aln_sorted.bam TE
Ssc_TE15_aln_sorted.bam TE
Ssc_CN01_aln_sorted.bam CN
Ssc_CN02_aln_sorted.bam CN
Ssc_CN04_aln_sorted.bam CN
Ssc_CN06_aln_sorted.bam CN
Ssc_CN08_aln_sorted.bam CN
Ssc_CN09_aln_sorted.bam CN
Ssc_CN10_aln_sorted.bam CN
Ssc_CN11_aln_sorted.bam CN
Ssc_CN12_aln_sorted.bam CN
Ssc_CN13_aln_sorted.bam CN

(3) I generated the AllSites VCF file using BCFtools, as listed in the pixy guide (https://pixy.readthedocs.io/en/latest/generating_invar/generating_invar.html).

Best regards,

Hi there, this error seems to associate with complex contig names containing non-alphanumeric characters. We're working on a fix for this, but in the meantime, you could try removing the non-alphanumeric characters from your contig names.

Hi, thanks for your response. I solved my problem by changing the character (;) of the contig to (_). I used the following command lines:

$bcftools query -f '%CHROM\n' myvcf.vcf.gz > chrom.txt
$cat chrome.txt | awk -F ";" '{print $1";"$2" "$1"_"$2}' >> chrom_mod.txt
$bcftools annotate --rename-chrs chrom_mod.txt myvcf.vcf.gz | bgzip > my_vcf_ready_to_pixy.vcf.gz