szhan/tsimpute

Split `prepare_dataset.ipynb` into separate notebooks

Closed this issue · 1 comments

szhan commented

Right now, this one notebook does the following:

  1. Download unified genealogies.
  2. Simplify the trees down to only high-coverage individuals.
  3. Split the individuals into reference panel and target cohort (one set of trees per group).
  4. Prepare data objects and files (VCFs and samples) for imputation.
  5. Impute using BEAGLE.
  6. Impute using tskit.lshmm.

It is easier to divide them up into the following stages, one per notebook:

  1. Steps 1 to 3.
  2. Step 4. This involves writing to VCF and making samples compatible, but it should be soon accelerated using sgkit.
  3. Step 5.
  4. Step 6.