neurogenomics/MAGMA_Celltyping

Parallelise MAGMA steps

Opened this issue · 1 comments

A parallelised version of MAGMA was recently described here:
https://star-protocols.cell.com/protocols/1392

Notes on parallelisation

The following tests were conducted on the Neurogenomics private cloud, within a virtual machine instance with 252GB of RAM, (8?)TB of storage, and 64 CPU cores (128 threads).

MungeSumstats::import_sumstats

Error description

Documented here: https://github.com/neurogenomics/MungeSumstats/issues/113

BiocParallel errors
  0 remote errors, element index: 
  50 unevaluated and other errors
  first remote error:
  • No sumstats produced.
  • Also recorded messages from the wrong/next id!
  • Seems to occur with smaller files, <2M variants.
  • VCF and tabix index download fine (checked file sizes against manual download from OpenGWAS).
  • Only using ~half of available memory at most, so doesn't seem to be crashing due to that.
  • Seems to trace back to read_vcf_parallel step.
  • Restarting the Docker container and/or decreasing the number of threads to 30 seems to resolve this.
    • This suggests that perhaps the issue is breaking the VCFs into too small of chunks when the VCF is already small leads to situations where some chunks are empty, causing an error when being merged with all the other chunks.

Current solution

  • Limit the number of parallel threads to <=30.

Reprex

subcategories3 <- c("neurological","Immune","cardio")
metagwas3 <- MungeSumstats::find_sumstats(subcategories = subcategories3)
meta <- filter_traits(meta = metagwas3,  
                           group_var = "subcategory",
                           topn = 100) 

gwas_paths <- MungeSumstats::import_sumstats(
  ids = meta$id, 
  save_dir = here::here("data/GWAS_sumstats"), 
  nThread = 30, # >30 causes issues with read_vcf_parallel
  parallel_across_ids = FALSE, 
  force_new_vcf = FALSE,
  force_new = FALSE,
  vcf_download = TRUE,
  vcf_dir = here::here("data/VCFs"),
  ### axel will keep trying forever if the URL doesn't exist (or is private)
  # download_method = "axel",
  #### Record logs
  log_folder_ind = TRUE,
  log_mungesumstats_msgs = TRUE,
  ) 

MAGMA.Celltyping::map_snps_to_genes

Error description

Parallelising map_snps_to_genes across too many threads actually seems to slow everything down, despite having sufficient memory.

This might be because:

  1. They are all calling to the same underlying MAGMA C code.
  2. They are reserving more memory in C than the computer thinks is available. (seems likely)
  3. They are all reading from the same reference files (e.g. 1KG).
  4. MAGMA is somehow already parallelising processes internally.
  5. Docker is launching some processes that don't get killed when restarting the container.
  6. Intermediate MAGMA files are getting repeatedly appended and causing problems in subsequent reruns.
  7. genes_only=TRUE? but this should make things faster, not slower
    • Confirmed that setting genes_only=TRUE seems to speed things up (thought unsure how significantly).
  8. Newer versions of MAGMA(v1.10) has bugs that causes everything to slow down.
    • MAGMAv1.08 might be a bit faster, but hard to tell bc uses diff files each time.

Current solution

Limit the number of parallel threads to <=20. This results in low memory usage (~11/252GB at any given time, indicated by the green bar in htop) but, perhaps importantly, ensures that the amount of memory being reserved only reaches ~2/3 of the max (the yellow bar in htop).

Screenshot 2022-08-10 at 11 45 51

Reprex

source("https://github.com/neurogenomics/MAGMA_Files_Public/raw/master/code/utils.R")

save_dir <- here::here("data/GWAS_sumstats")
meta <- gather_metadata(save_dir = save_dir,
                        N_dict=c("Wightman2021"=1126563, 
                                 "Vuckovic2020"=408112))
data.table::setkey(meta,id)

t1 <- Sys.time()
magma_files <- parallel::mclapply(seq_len(nrow(meta)),
                                  function(i){ 
  EWCE:::message_parallel("----- ",i," : ",
                                      meta$id[i]," -----") 
     tryCatch(expr = { 
       MAGMA.Celltyping::map_snps_to_genes(
         # version = "1.08",
         path_formatted = meta$munged_path[i],
         genome_build = meta$build_final[i],
         N = if(is.na(meta$N[i])) NULL else meta$N[i],
         population = meta$population_1KG[i],
         upstream_kb = 35,  
         downstream_kb = 10,
         genes_only = TRUE, 
         force_new = FALSE
         )
     }, error = function(e) {EWCE:::message_parallel(e);NULL}) 
}, mc.cores = min(nrow(meta),20) ) |> `names<-`(meta$id) 
t2 <- Sys.time()
print(t2-t1)