Long Format for SNP_ID
jabbarlinks opened this issue · 12 comments
Respected Dr. Frei,
My bim file contains long format for SNP ID like 22:16541711_T_G , similarly in my GWAS file. But when I run fit1 then it gives segmentation error message. I check everything so many times but not successful.
However, I repeated all these code successfully for univariate & bivariate analysis & visualization etc by taking some GWAS from internet. The only difference I found is these files have short format of SNP ID is like rs1755411.
So, my question, does long variant name works in MiXer or not? Or I have to change my bim fam bed files in short variant name like rs15425421
Kind Regards
Respected Dr. Frei,
Thanks for you quick reply and sorry for repeating it.
So, it means MiXeR does not recognized SNP column if it is CHR:bp_a1_a2 format ?
If its true then my focused is to convert bfiles (bim, fam, bed) in rs# format by separating all columns (I did not know this, I have to learn it).
However, converting only sample data file in rs# does not work as bfiles have long variant CHR:bp_a1_a2 format which obviously not the same as in sample data.
Need you kind advise in this regard.
Regards
MiXeR does not recognized SNP column if it is CHR:bp_a1_a2 format ?
This is correct. MiXeR doesn't recognize this format for SNP column.
convert bfiles (bim, fam, bed) in rs#
No, you don't need to do this. My suggestion is to convert the summary statistics (i.e. the files you feed into --trait1 and --trait2).
Respected Dr. Frei,
Thanks alot, but converting only sample GWAS does not work and same segmentation error which means the files I am looking for are not available (SNP does not match). However, there is not problem to access the directories.
Many thanks for your kind cooperation, I will further re-look in details and try after converting bfiles and update you if it works.
Regards
Dear Dr. Frei,
Might be I am wrong, so want to update you and need your views as still problem exists.
Now my sample GWAS file (which is taken from .bim by simulation methods, so every chr, bp, a1, a2 in sample GWAS must exists in .bim file too). Then, added corresponding rs# to my sample file taken from **Kaviar - known variants in the human genome, release 160204 (systemsbiology.net) ** as hg19 GRCh37.
The sample GWAS file looks like
SNP CHR bp PVAL A1 A2 N BETA SE
rs139711966 22 16904583 0.580567 T C 2000 -0.03325 0.060159
rs62228723 22 17065548 0.19045 A A 2000 0.086706 0.066203
rs5748002 22 17075476 0.571352 A A 2000 0.030454 0.053791
rs5993571 22 17089569 0.185429 G A 2000 0.04896 0.03696
.bim files looks like
22 22:16904056_A_G 0 16904056 A G
22 22:16904583_C_T 0 16904583 C T
22 22:16905630_G_A 0 16905630 G A
22 22:16906282_G_A 0 16906282 G A
22 22:16906517_C_A 0 16906517 C A
But still the error was same; segmentation error. Now only 1 solution in my mind is revise / regenerate .bim .bed .fam file by adding rs#.
Kind Regards
You GWAS file look (almost) compatible with MiXeR. I think it lacks "z-score", but apart from that it should work.
But still the error was same; segmentation error.
Please post here the specific error you get.
Dear Dr. Frei,
Many thanks, I am pasting here error and log file also. I also have z-score, same error when i use GWAS with z-score. Hhowever, I will post error in next seprate post to avoid too long.
Error message:
Call:
./mixer.py fit1
--out /home/jabbar/Plink/retest/train_fit.rep.22
--lib /home/jabbar/mixer/src/build/lib/libbgmg.so
--bim-file /home/jabbar/Plink/retest/EURchr22bedformat.bim
--ld-file /home/jabbar/Plink/retest/EURchr22bedformat.run4.ld
--trait1-file /home/jabbar/Plink/retest/train11_qc_noMHC.csv.gz
--extract /home/jabbar/Plink/retest/EURchr22bedformat.QC.snps
)
INFO:root:init(lib_name=/home/jabbar/mixer/src/build/lib/libbgmg.so, context_id=0)
Segmentation fault (core dumped)
log file:
20210706 23:54:47.431707 =* GNU General Public License v3
20210706 23:54:47.431827 =***********************************************************************
20210706 23:54:47.431947 =Call:
20210706 23:54:47.432066 =./mixer.py fit1
20210706 23:54:47.432186 = --out /home/jabbar/Plink/retest/train_fit.rep.22
20210706 23:54:47.432306 = --lib /home/jabbar/mixer/src/build/lib/libbgmg.so
20210706 23:54:47.432430 = --bim-file /home/jabbar/Plink/retest/EURchr22bedformat.bim
20210706 23:54:47.432550 = --ld-file /home/jabbar/Plink/retest/EURchr22bedformat.run4.ld
20210706 23:54:47.432670 = --trait1-file /home/jabbar/Plink/retest/train11_qc_noMHC.csv.gz
20210706 23:54:47.432790 = --extract /home/jabbar/Plink/retest/EURchr22bedformat.QC.snps
20210706 23:54:47.432943 Dispose context (id=0)
20210706 23:54:47.434795 Create new context (id=0)
20210706 23:54:47.434994 >init(bim_file=/home/jabbar/Plink/retest/EURchr22bedformat.bim, frq_file=, chr_labels=1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22, trait1_file=/home/jabbar/Plink/retest/train11_qc_noMHC.csv.gz, trait2_file=, exclude=, extract=/home/jabbar/Plink/retest/EURchr22bedformat.QC.snps);
20210706 23:54:47.456043 Construct reference from 1 files...
20210706 23:54:47.498489 Found 12316 variants in /home/jabbar/Plink/retest/EURchr22bedformat.bim
20210706 23:54:47.499449 Found 12316 variants in total.
20210706 23:54:47.604790 Found 409 variants with well-defined Z and N in /home/jabbar/Plink/retest/train11_qc_noMHC.csv.gz. Other statistics:
20210706 23:54:47.605204 608 lines found (including header)
20210706 23:54:47.605502 466 lines matched via CHR:BP:A1:A2 code (not SNP rs#)
20210706 23:54:47.605796 141 lines were ignored as RS# (or chr:bp:a1:a2 code) did not match reference file.
20210706 23:54:47.606061 57 variants were ignored as they are strand-ambiguous.
20210706 23:54:47.606325 409 variants had flipped A1/A2 alleles; sign of z-score was flipped.
20210706 23:54:47.606758 constrain analysis to 409 tag variants (due to trait1_file='/home/jabbar/Plink/retest/train11_qc_noMHC.csv.gz')
20210706 23:54:47.643059 constrain analysis to 348 tag variants (due to extract='/home/jabbar/Plink/retest/EURchr22bedformat.QC.snps')
20210706 23:54:47.643949 set_tag_indices(num_snp=12316, num_tag=348);
20210706 23:54:47.644291 >set_chrnumvec(12316);
20210706 23:54:47.644543 <set_chrnumvec(12316);
20210706 23:54:47.644786 >LdMatrixCsr::init_chunks();
20210706 23:54:47.645091 highest chr label: 22
20210706 23:54:47.645443 <LdMatrixCsr::init_chunks();
20210706 23:54:47.645613 set_zvec(trait=1); num_undef=0
20210706 23:54:47.645891 set_nvec(trait=1); num_undef=0
20210706 23:54:47.646164 <init(bim_file=/home/jabbar/Plink/retest/EURchr22bedformat.bim, frq_file=, chr_labels=1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22, trait1_file=/home/jabbar/Plink/retest/train11_qc_noMHC.csv.gz, trait2_file=, exclude=, extract=/home/jabbar/Plink/retest/EURchr22bedformat.QC.snps); elapsed time 211ms
20210706 23:54:47.647013 set_option(r2min=0);
20210706 23:54:47.647580 set_option(kmax=20000);
20210706 23:54:47.647928 clear_state
20210706 23:54:47.648116 set_option(seed=123);
20210706 23:54:47.648510 set_option(cubature_rel_error=1e-05);
20210706 23:54:47.648857 set_option(cubature_max_evals=1000);
20210706 23:54:47.649205 initialize mafvec
20210706 23:54:47.649388 >load_ld_matrix(filename=/home/jabbar/Plink/retest/EURchr22bedformat.run4.ld)
20210706 23:54:47.687400 <load_ld_matrix(filename=/home/jabbar/Plink/retest/EURchr22bedformat.run4.ld), format version 1
20210706 23:54:47.687798 set_ld_r2_coo(filename=/home/jabbar/Plink/retest/EURchr22bedformat.run4.ld, numel=143202)...
20210706 23:54:47.694007 >set_ld_r2_coo(chr_label=1, length=143202);
20210706 23:54:47.694913 LdMatrixCsr::init_diagonal(chr_label=1) added 0 tag (out of 0 snps) elements with r2=1.0 to the diagonal of LD r2 matrix
Dear Dr. Frei,
Just for reference, same error like above for GWAS with Z score.
Sample Gwas is as
CHR SNP POS ID A1 A2 N Z P BETA SE
22 rs139711966 16904583 22:16904583_C_T C T 3000 -1.64042 0.101024 -0.083106 0.0506615
22 rs2890567 17065545 22:17065545_A_G A A 3000 -0.818327 0.413235 -0.0273337 0.033402
22 rs62228723 17065548 22:17065548_A_T A A 3000 1.16552 0.243903 0.0674969 0.0579116
22 rs5993571 17089569 22:17089569_A_G A G 3000 -1.27401 0.202757 -0.0400473 0.0314339
22 rs5993992 17225773 22:17225773_T_C T C 3000 1.77873 0.0753845 0.074904 0.0421108
22 rs165652 17315458 22:17315458_T_G T T 3000 -0.433815 0.664454 -0.0132143 0.0304607
22 rs62235933 17363636 22:17363636_G_T G T 3000 1.11042 0.266907 0.0475201 0.0427947
@jabbarlinks hi, and sorry for delay - I've been away
Please try adding "--chr2use 22" flag to your mixer.py command. By default --chr2use is 1-22, which doesn't work in your example as the data is limited to chr22.
From this message:
20210706 23:54:47.694007 >set_ld_r2_coo(chr_label=1, length=143202);
20210706 23:54:47.694913 LdMatrixCsr::init_diagonal(chr_label=1) added 0 tag (out of 0 snps) elements with r2=1.0 to the diagonal of LD r2 matrix
it seem that MiXeR doesn't correctly handle "empty" chromosomes. I agree it shouldn't crash, but give a meaningful error - but for now please try to constrain the analysis to chr22.
I have an example which works on chr21 here: https://github.com/comorment/containers/blob/main/usecases/mixer_simu.md
Respected Dr. Frei,
Many thanks for your valuable feedback, Yes, it work 100% now.
Many many thanks
@jabbarlinks Excellent, thank you for letting me know! 👍
Dear Dr Frei @ofrei,
I was having a similar issue to this - I was getting a segmentation fault when I tried to submit mixer.py ld to generate .ld and .snps files. I was doing this on just chr 22 to make sure it works before I submit jobs for all of the chromoromes. I assumed that perhaps the error was similar, so I added --chr2use 22
into the script like this:
python ./mixer/precimed/mixer.py ld --lib ./mixer/src/build/lib/libbgmg.so --bfile 1000G_EUR_Phase3_plink/1000G.EUR.QC.22 --out 1000G_EUR_phase3_plink/1000G.EUR.QC.22.run4.ld --chr2use 22 --r2min 0.05 --ldscore-r2min 0.05 --ld-window-kb 30000
But I got the error:
usage: mixer.py [-h] {fit1,test1,fit2,test2,ld,perf,snps} ... mixer.py: error: unrecognized arguments: --chr2use 22
Have I implemented this fix incorrectly?
The readme mentions that this step is not necessary for EUR data, which is what I'm using, however I couldn't work out how to do the univariate step without the .ld and .snps files that it needs.
Any help you could offer with this would be very much appreciated!
Many thanks,
Anja