Imputation crashes if genomic chunks are not quite empty
rkwalters opened this issue · 3 comments
When dividing data into the ~950 genomic chunks for imputation, it is possible to end up with chunks that contain very few genotyped markers (i.e. <5), particularly near centromeres and at the ends of chromosomes. As long as at least 1 genotyped SNP is in the chunk, ricopili will try to phase and impute. During pre-phasing, if there are any individuals that are missing for all SNPs in a chunk, shapeit will exit with an error. When there are very few SNPs in the chunk it is likely that at least one individual will be completely missing and phasing will fail. Ricopili will eventually error out once it realizes it is stuck trying to phase those chunks.
Current workaround:
- Identify failing chunks from the job list of the last phasing attempt (e.g.
./pi_sub/preph_mu1.job_list.s
) and exclude them from imputation with--refiex
flag. The cause of the error can be verified in shapeit log file./pi_sub/haps_*/plink.*.chrX_XXX_XXX.shape.log
for relevant chunk.
Possible solutions:
- set minimum number of SNPs in chunk to count as not empty for imputation (maybe 10?)
- directly check that no IDs are missing for all SNPs in chunk
This behavior is currently not being observed on computerome or genomeDK environments (which complete imputation silently without the failed chunk), but still occurs on Broad. Unclear whether the difference is due to cluster environment, third-party software versions, or some other difference.
The result on computerome/genomeDK is the expected behavior, provided by lines 2233-2248 of impute_dirsub_57
.
Running on Broad (which has $qloc=qsub_b
), this code block checking for the "fully missing" error from shapeit is unreachable. Since qsub_b
disallows the --multi
arg for prephasing, there will never be logs to detect that jobs preph_mu1
were run (see line 2186), so the error catching that is attempted when $multi_sw==2
to mark affected chunks as empty never happens.
I've added branch broad_empty_shapeit
with a proposed fix. It allows qsub_b
to set $multithread1=1
to use as a dummy arg in preph_mu1
, and to provide full error text if there are still jobs for preph_mu2
on qsub_b
(after catching "fully missing") which would be pointless without multithreading.