Split `prepare_dataset.ipynb` into separate notebooks
Closed this issue · 1 comments
szhan commented
Right now, this one notebook does the following:
- Download unified genealogies.
- Simplify the trees down to only high-coverage individuals.
- Split the individuals into reference panel and target cohort (one set of trees per group).
- Prepare data objects and files (VCFs and samples) for imputation.
- Impute using BEAGLE.
- Impute using
tskit.lshmm
.
It is easier to divide them up into the following stages, one per notebook:
- Steps 1 to 3.
- Step 4. This involves writing to VCF and making samples compatible, but it should be soon accelerated using
sgkit
. - Step 5.
- Step 6.
szhan commented
Moving this to https://github.com/szhan/onekg_analysis