Problem while conversion from IMPTE2PEDMAP format code

Question

Problem while conversion from IMPTE2PEDMAP format code

Closed this issue 7 years ago · 5 comments

muhammadsohailraza commented 7 years ago

Hi, I trying to convert Shapeit phased files to plink format my command-line was:
Step1: completed successfully
python convert_shapeit2_to_impute2.py CHR2.Phased.haps CHR2.Phased.sample impute.chr2.haps impute.chr2.legend impute.chr2.sample

Step2: (Error)
python convert_impute2_to_PEDMAP.py impute.chr2.haps impute.chr2.legend CHR2.Phased.sample chr2 2

It prompts an error message:
Error:
Traceback (most recent call last):
File "/share_bio/unisvx3/zengchq_group/sohail/softwares/SHAPEIT2PLINK/SHAPEIT_to_PLINK-master/convert_impute2_to_PEDMAP.py", line 251, in
p_id = [x[3] for x in sample_info]
IndexError: list index out of range

Can anyone please help me to resolve this issue?

Thanks!

-sohail

Answer 1 · 2017-07-12T14:28:05.000Z

Hi,
The error hints at some sort of problem/inconsistency in the SHAPEIT2 .sample file (see line 246, quoted below):

sample_info = [x.replace('\n', '').split() for x in open(sys.argv[3], 'r').readlines()[2:]]

because the list sample_info is populated after reading that file and is used later in line 251 to populate the list p_id which is the origin of the error message.

Are you sure your original CHR2.Phased.sample file adheres to the SHAPEIT2 file format (see here)? It could be that recent versions of SHAPEIT have changed the file format (I am not sure about this though!).

Answer 2 · 2017-07-13T03:09:16.000Z

Hi,
My sample file format is:

ID_1 ID_2 missing
0 0 0
103-RQ 103-RQ 0
109-EP 109-EP 0
16-EJ 16-EJ 0
31-CE 31-CE 0
40-JI 40-JI 0
43-AM 43-AM 0
50-JB 50-JB 0

I uploaded the file here: https://www.dropbox.com/s/4o4rtsmpu3cm82f/CHR2.Phased.sample?dl=0
It appears differently in windows notepad and here.

I wonder do we only require three columns in the sample file as in the code there might be indicated many as shown below (Line 246):

sample_info = [x.replace('\n', '').split() for x in open(sys.argv[3], 'r').readlines()[2:]]

sample_names = [x[1] for x in sample_info]
family_id = [x[0] for x in sample_info]
p_id = [x[3] for x in sample_info]
m_id = [x[4] for x in sample_info]
gender = [x[5] for x in sample_info]
pheno = [x[6] for x in sample_info]

Please have a look!

Thanks

Answer 3 · 2017-07-13T14:00:51.000Z

Thanks for uploading the file. Yes, the problem is that your .sample file has only three columns. I am guessing that my phased data had more columns, corresponding to indices 3 to 6, and that is why you are getting the error message. You can comment out the following lines, as follows:

# p_id = [x[3] for x in sample_info]
# m_id = [x[4] for x in sample_info]
# gender = [x[5] for x in sample_info]
# pheno = [x[6] for x in sample_info]

and change lines 256-273 to the following:

returned = Convert_impute2_to_PEDMAP(
	sys.argv[5],	# chromosome number
	sys.argv[2],	# .legend file
	sys.argv[1],	# .haps file
	sample_names,
	None,
	family_id,
	None,
	# p_id,
	# None,
	# m_id,
	# None,
	# gender,
	# None,
	# pheno,
	# None,
	sys.argv[4]		# output file name
)

and lines 6-23 to the following

def Convert_impute2_to_PEDMAP(
	chromosome = None, 
	legend_file = None, 
	haplotypes_file = None, 
	sample_names = None,
	sample_names_filename = None,
	family_id = None,
	family_id_filename = None,
	# p_id = None,
	# p_id_filename = None,
	# m_id = None,
	# m_id_filename = None,
	# gender = None,
	# gender_filename = None,
	# pheno = None,
	# pheno_filename = None,
	output = None, 
):

and change line 57 to the following

pedInfo.append([family_id[currentIndiv], sample_names[currentIndiv]])#, p_id[currentIndiv], m_id[currentIndiv], gender[currentIndiv], pheno[currentIndiv]])

as a temporary fix.

Answer 4 · 2017-07-18T14:45:38.000Z

@muhammadsohailraza did my previous reply fix your issue?

Answer 5 · 2017-07-19T11:15:58.000Z

Hi @baharian
Actually, i was in hurry and i simply preferred to add extra columns in the sample file rather than changing the code and it works perfectly fine.. (there should be 7 columns starting from line 2)

For instance:

ID_1 ID_2 missing
0 0 0 0 0 0 0
103-RQ 103-RQ 0 0 0 0 0
109-EP 109-EP 0 0 0 0 0
16-EJ 16-EJ 0 0 0 0 0
31-CE 31-CE 0 0 0 0 0
40-JI 40-JI 0 0 0 0 0
43-AM 43-AM 0 0 0 0 0
50-JB 50-JB 0 0 0 0 0