Ripkelab/ricopili

Imputation crashes if genomic chunks are not quite empty

rkwalters opened this issue · 3 comments

When dividing data into the ~950 genomic chunks for imputation, it is possible to end up with chunks that contain very few genotyped markers (i.e. <5), particularly near centromeres and at the ends of chromosomes. As long as at least 1 genotyped SNP is in the chunk, ricopili will try to phase and impute. During pre-phasing, if there are any individuals that are missing for all SNPs in a chunk, shapeit will exit with an error. When there are very few SNPs in the chunk it is likely that at least one individual will be completely missing and phasing will fail. Ricopili will eventually error out once it realizes it is stuck trying to phase those chunks.

Current workaround:

  • Identify failing chunks from the job list of the last phasing attempt (e.g. ./pi_sub/preph_mu1.job_list.s) and exclude them from imputation with --refiex flag. The cause of the error can be verified in shapeit log file ./pi_sub/haps_*/plink.*.chrX_XXX_XXX.shape.log for relevant chunk.

Possible solutions:

  • set minimum number of SNPs in chunk to count as not empty for imputation (maybe 10?)
  • directly check that no IDs are missing for all SNPs in chunk

This behavior is currently not being observed on computerome or genomeDK environments (which complete imputation silently without the failed chunk), but still occurs on Broad. Unclear whether the difference is due to cluster environment, third-party software versions, or some other difference.

The result on computerome/genomeDK is the expected behavior, provided by lines 2233-2248 of impute_dirsub_57.

Running on Broad (which has $qloc=qsub_b), this code block checking for the "fully missing" error from shapeit is unreachable. Since qsub_b disallows the --multi arg for prephasing, there will never be logs to detect that jobs preph_mu1 were run (see line 2186), so the error catching that is attempted when $multi_sw==2 to mark affected chunks as empty never happens.

I've added branch broad_empty_shapeit with a proposed fix. It allows qsub_b to set $multithread1=1 to use as a dummy arg in preph_mu1, and to provide full error text if there are still jobs for preph_mu2 on qsub_b (after catching "fully missing") which would be pointless without multithreading.

Fixed in #34 and merged to master in #43.