Runtime estimates for imputation step?
Closed this issue · 2 comments
Hi there,
I'm running the impute_runner.py step on a toy data set with just 4 sibpairs (no parents) genotyped on an array with ~450k variants.
Attached is my log file. The "create pedigree" and "prepare_data" functions seem to be executed with no problems, but the impute() function runs for a very long time (eventually timed out after 12 hours). There are a few pandas warnings (e.g. "SettingWithCopyWarning"), but it's not clear if these are actually problematic.
I'm wondering if you have an idea of how long I should expect this to take? And if you think the pandas warnings are a problem, that would be great to know as well.
Thanks for the help!
Cassie
run_snipar_t1dgc_test.log
Hi,
Sorry for the delay, I missed the issue. Although 450,000 SNPs is a lot, It's just four pairs so it shouldn't take an hour let alone more that 12 hours.
First weird thing is that the package thinks it's running the imputation for chromosome 1 to 26. Could you show me the command you have used? If you have specified chromosomes with wildcards, do any of the data files match with the wild card with numbers 23-24-25-26 in them?
I don't think that the warnings are relevant to the issue you are seeing, you see them because the code is not using one of pandas best practices. Another thing that is worth a try is running the imputation with --start 100 --end 200 option. This restricts the imputation to 100 SNPs. This imputation should be done in seconds so it might tell us what's going on.