/DEEP-HLA

Upload test

Primary LanguagePythonOtherNOASSERTION

DEEP-HLA_logo

DEEP*HLA

DOI

DEEP*HLA is an HLA allelic imputation method based on a multi-task convolutional neural network implemented in Python.

DEEP*HLA receives pre-phased SNV data and outputs genotype dosages of binary HLA alleles.

In DEEP*HLA, HLA imputation is performed in two processes:

​ (1) model training with an HLA reference panel

​ (2) imputation with a trained model.

Publication/Citation

The study of DEEP*HLA is described in the manuscript.

  • Naito, T. et al. A deep learning method for HLA imputation and trans-ethnic MHC fine-mapping of type 1 diabetes. Nat. Commun. 12, 1639 (2021). doi.org/10.1038/s41467-021-21975-x

Please cite this paper if you use DEEP*HLA or any material in this repository.

Requirements

  • Python 3.x (3.7.4)
  • Pytorch (1.4.0)
  • Numpy (1.17.2)
  • Pandas (0.25.1)
  • Scipy (1.3.1)
  • Argparse (1.4.0)

DEEP*HLA was tested on the versions in parentheses, so we do not guarantee that it will work on different versions.

Installation

Just clone this repository as folllows.

git clone https://github.com/tatsuhikonaito/DEEP-HLA
cd ./DEEP-HLA

Usage

0. Original file formats

The original files for model and HLA information are needed to run DEEP*HLA.

  • {MODEL}.model.json

    The description of a model configuration, including grouping of HLA genes, window size of SNV (Kb), and parameters of neural networks. The gene names must be consistent with reference data.

    {
    	"group1": {
    		"HLA": ["HLA_F", "HLA_G", ...],
    		"w": 500,
    		"conv1_num_filter": 128,
    		"conv2_num_filter": 64, 
    		"conv1_kernel_size": 64, 
    		"conv2_kernel_size": 64, 
    		"fc_len": 256
    	},
    	"group2": {
    		"HLA": ["HLA_C", "HLA_B", ...],
    		"w": 500,
    		"conv1_num_filter": 128,
    		"conv2_num_filter": 64, 
    		"conv1_kernel_size": 64, 
    		"conv2_kernel_size": 64, 
    		"fc_len": 256
    	},	
    	...
    }
    
  • {HLA}.hla.json

    The description of information of HLA genes in reference data, including HLA gene names, position, and HLA allele names for each resolution. They must be consistent with reference data.

    {
    	"HLA_F": {
    		"pos": 29698429,
    		"2-digit": ["HLA_F_01", ...],
    		"4-digit": ["HLA_F_01:01", "HLA_F_01:03", ...]
    	},
    	"HLA_G": {
    		"pos": 29796823,
    		"2-digit": ["HLA_G_01", ...],
    		"4-digit": ["HLA_G_01:01", "HLA_G_01:03", ...]
    	},
    	...
    }
    

    An HLA information file can be made from a REFERENCE.bim file using make_hlainfo.py as follows.

    $ python make_hlainfo.py --ref REFERENCE (.bim)
    
    Arguments and options
    Option name Descriptions Required Default
    --ref HLA reference data (.bim format). Yes None
    --max-digit Maximum resolution of alleles typed in the HLA reference data ("2-digit", "4-digit", or "6-digit"). No "4-digit"
    --output Output filename for HLA information JSON file No {BASE_DIR}/{REFERENCE}.hla.json
    Outputs
    • {REFERENCE}.hla.json

      Generated HLA information file.

1. Model training

Run train.py on a command-line interface as follows.

Sample files should have only the MHC region extracted for HLA imputation (typically, chr6:29-34 or 24-36 Mb). In addition, the strands must be consistent between the sample and reference data.

HLA reference data are currently only supproted in Beagle-phased format.

$ python train.py --ref REFERENCE (.bgl.phased/.bim) --sample SAMPLE (.bim) --model MODEL (.model.json) --hla HLA (.hla.json) --model-dir MODEL_DIR
Arguments and options
Option name Descriptions Required Default
--ref HLA reference data (.bgl.phased, and .bim format). Yes None
--sample Sample SNP data of the MHC region (.bim format). Yes None
--model Model configuration (.model.json format). Yes None
--hla HLA information of the reference data (.hla.json format). Yes None
--model-dir Directory for saving trained models. No {BASE_DIR}/model
--num-epoch Number of epochs to train. No 100
--patience Patience for early-stopping. If you prefer no early-stopping, specify the same value as --num-epoch. No 16
--val-split Ratio of splitting data for validation. No 0.1
--max-digit Maximum resolution of alleles to impute ("2-digit", "4-digit", or "6-digit"). No "4-digit"
Outputs
  • {MODEL_DIR}/{group}_{digit}_{hla}.pickle

    Trained models.

  • {MODEL_DIR}/model.bim

    SNP information used in training and subsequent imputation process.

  • {MODEL_DIR}]/best_val.txt

    Accuracies of trained models in validation process.

2. Imputation

After you have finished training a model, run impute.py as follows.

Phased sample data are supported in Beagle-phased format and Oxford haps format (SHAPEIT, Eagle, etc.).

$ python impute.py --sample SAMPLE (.bgl.phased (.haps)/.bim/.fam) --model MODEL (.model.json) --hla HLA (.hla.json) --model-dir MODEL_DIR --out OUT
Arguments and options
Option name Descriptions Required Default
--sample Sample SNP data of the MHC region (.bgl.phased or .haps, .bim, and .fam format). Yes None
--phased-type File format of sample phased file ("bgl" or "haps"). No "bgl"
--model Model configuration (.model.json and .bim format). Yes None
--hla HLA information of the reference data (.hla.json format). Yes None
--model-dir Directory where trained models are saved. No {BASE_DIR}/model
--out Prefix of output files. Yes None
--max-digit Maximum resolution of alleles to impute ("2-digit", "4-digit", or "6-digit"). No "4-digit"
--mc-dropout Whether to calculate uncertainty by Monte Carlo dropout (True or False). No False
Outputs
  • {OUT}.deephla.phased

    Imputed allele phased (best-guess genotypes) data.

    Rows are markers and columns are individuals.

    First column is marker name; and subsequent columns are genotypes as two columns per individual.

  • {OUT}.deephla.dosage

    Imputed allele dosage data.

    First, second, and third columns are marker name, allele1 ("P"), and allele2 ("A"); and subsequent columns are dosages as one column per individual.

    Rows are markers and columns are individuals, as one column per individual.

  • {OUT}.deephla.entropy (optional)

    Uncertainty based on entropy of sampling variation in Monte Carlo dropout.

    First column is marker name; and subsequent columns are entropys as one column per individual.

Example

Here, we demonstrate a practical usage with an example of Pan-Asian reference panel.

The trained models have already been stored in Pan-Asian/model, so you can skip the model training process.

0. Data preparation

First, dowload Pan-Asian reference panel data and example data at SNP2HLA dowload site.

Perform pre-phasing of the example data with any phasing software (SHAPEIT, Eagle, and Beagle, etc.), and generate a 1958BC.haps (or .bgl.phased) file.

Put them into Pan-Asian directory.

DEEP-HLA/
 └ Pan-Asian/
   ├ Pan-Asian_REF.bgl.phased
   ├ Pan-Asian_REF.bim
   ├ Pan-Asian_REF.config.json
   ├ Pan-Asian_REF.info.json
   ├ 1958BC.haps (or .bgl.phased)
   ├ 1958BC.bim
   ├ 1958BC.fam
   └ model/

1. Model training

We have already uploaded a trained model, so you can skip this step.

Otherwise, run train.py as follows. The files in Pan-Asian/model directory will be overwritten.

$ python train.py --ref Pan-Asian/Pan-Asian_REF --sample Pan-Asian/1958BC --model Pan-Asian/Pan-Asian_REF --hla Pan-Asian/Pan-Asian_REF --model-dir Pan-Asian/model

2. Imputation

Run impute.py as follows.

$ python impute.py --sample Pan-Asian/1958BC --phased-type haps --model Pan-Asian/Pan-Asian_REF --hla Pan-Asian/Pan-Asian_REF --model-dir Pan-Asian/model --out Pan-Asian/1958BC

3. Imputation of amino acid polymorphisms

Run impute_aa.py as follows.

$ python impute_aa.py --dosage Pan-Asian/1958BC --aa-table Pan-Asian/Pan-Asian_REF --out Pan-Asian/1958BC

4. Other HLA referece panels

Please follow the application process to obtain the two reference panels used in our study.

  • Our Japanese HLA data have been deposited at the National Bioscience Database Center (NBDC) Human Database (research ID: hum0114).
  • T1DGC HLA reference panel can be download at the NIDDK central repository with a request.

Their related files for imputation (.model.json, .hla.json, and .aa_table.pickle) may be provided upon request.

Reference

DEEP*HLA uses MGDA-UB (Multiple Gradient Descent Algorithm - Upper Bound) for multi-task learning, and the source code of its part is implemented with the modification of MultiObjectiveOptimization.

Contact

For any question, you can contact Tatsuhiko Naito (tnaito@sg.med.osaka-u.ac.jp)

One of the advantages of DEEP*HLA is that model training can be done in another place, even without sample genotypes. We may consider tailoring a DEEP*HLA model with our own or publicly available reference panels that fits your SNP data. Please consider asking us by email individually if you have interest in it.