mathii/gdc

format of the .ind file for vcf2eigenstrat

Closed this issue · 11 comments

Hi!
I'm trying to use your vcf2eigenstrat.py script using the .ind file, which I assumed was a two column file with the individual's name and the pop name associated to it (pop_map-like), but it didn't work

python /home/buitracn/RADseq/tools/gdc/vcf2eigenstrat.py -v spis.296.all.filtered.markers.recode.vcf -i pop_map_spis296.ind -o spis296.all.loci.eigenstrat
[('-v', 'spis.296.all.filtered.markers.recode.vcf'), ('-i', 'pop_map_spis296.ind'), ('-o', 'spis296.all.loci.eigenstrat')] []
-v spis.296.all.filtered.markers.recode.vcf
-i pop_map_spis296.ind
-o spis296.all.loci.eigenstrat
found options:
{'indmap': 'pop_map_spis296.ind', 'indAsPop': False, 'ref': None, 'vcf': 'spis.296.all.filtered.markers.recode.vcf', 'out': 'spis296.all.loci.eigenstrat'}
Traceback (most recent call last):
File "/home/buitracn/RADseq/tools/gdc/vcf2eigenstrat.py", line 138, in
main(options)
File "/home/buitracn/RADseq/tools/gdc/vcf2eigenstrat.py", line 56, in main
pop_map[bits[0]]=bits[2]
IndexError: list index out of range

could you please define how should I format such file in order to work?

Thanks in advance,
Carol

The .ind file should be in eigenstrat format. It has three columns: Individual_Name, Sex, Population. Sex is usually coded as M/F/U. That won't necessarily matter for this conversion, but it may be used in downstream analysis.

Hi Carol,

Hm. Where do you find that eigenstrat format definition? I don't think that is correct. See line 45 onwards here: https://github.com/DReichLab/AdmixTools/blob/master/convertf/README.

The snp file contains 1 line per SNP. There are 6 columns (last 2 optional):
1st column is SNP name
2nd column is chromosome. X chromosome is encoded as 23.
...

  • Iain

The text you quoted refers to ped format, not eigenstrat/ancestrymap. If you use the original format you should be ok.

Note that my script output unpacked (i.e. plain text) eigenstrat/ancestrymap format. If you have a large dataset, it is worth converting it to PACKEDANCESTRYMAP binary format using convertf, which will take up much less space and be more efficient. I should add an option to do this automatically.

What is the error message in spis296.all.loci.eigenstrat.log?

Can't really see what is going on from that. If you are able to share your .snp .ind and .geno files, I could take a look and see if I can figure it out - you can email them to me. Or, if you can post the first few lines of each file, it might be clear what's going on.

You can email me at mathi@pennmedicine.upenn.edu. Don't send the files if they are too big though.

Try just renumbering all the chromosomes to "1". it shouldn't matter for the purposes of running PCA.