/prohlatype

Probabilistic HLA typing

Primary LanguageOCamlApache License 2.0Apache-2.0

Build Status Coverage Status

Probabilistic HLA Typing

Paper: Prohlatype: A Probabilistic Framework for HLA Typing 1

This project provides a set of tools to calculate the full posterior distribution of HLA types given read data.

Instead of:

	A1  	A2  	B1  	B2  	C1	    C2  	Reads	Objective
0	A*31:01	A*02:01	B*45:01	B*15:03	C*16:01	C*02:10	538.0	513.79

one can calculate:

Allele 1 Allele 2 Log P P
A*02:05:01:01 A*30:114 -23046.81 0.5000
A*02:05:01:01 A*30:01:01 -23046.81 0.5000
A*02:05:01:01 A*30:106 -23103.15 0.0000
A*02:05:01:02 A*30:114 -23146.35 0.0000
...
B*07:36 B*57:03:01:02 -13717.33 0.5000
B*07:36 B*57:03:01:01 -13717.33 0.5000
B*07:36 B*57:03:03 -13804.74 0.0000
B*27:157 B*57:03:01:02 -13816.17 0.0000
...
C*06:103 C*18:10 -11936.35 0.3338
C*06:103 C*18:02 -11936.36 0.3331
C*06:103 C*18:01 -11936.36 0.3331
C*15:102 C*18:02 -11951.72 0.0000

How:

There are three options to obtain the software:

  1. If you are running on Linux, standalone binaries are available with each release.

  2. Use the linked Docker image.

  3. Build the software from source:

    a. Install opam.

    b. Make sure that the opam packages are up to date:

     $ opam update
    

    c. Make sure that you're on the relevant compiler:

     $ opam switch 4.06.0
     $ eval `opam config env`
    

    d. Get source:

     $ git clone https://github.com/hammerlab/prohlatype.git prohlatype
     $ cd prohlatype
    

    e. Install the dependent packages:

     $ make setup
    

    f. Build the programs (afterwards they'll be in _build/default/src/apps):

     $ make
    

Make sure that you have IMGT/HLA available:

$ git clone https://github.com/ANHIG/IMGTHLA.git imgthla

"Prohla"-typing:

  1. Create an imputed HLA reference sequence via align2fasta. This step makes sure that all alleles have sequence information that spans the entire locus. This way, reads that originate from a region for which we normally do not have sequence information will still align (in the next filtering step), albeit poorly:

     $ align2fasta path-to-imgthla/alignments -o imputed_hla_class_I
    

    This step needs to be performed only once, per each IMGT version. Run $align2fasta --help for further information.

  2. Filter your data against the reference, by first aligning. Ex:

     $ bwa mem imputed_hla_class_I.fasta ${SAMPLE}.fastq | \
         samtools view -F 4 -bT imputed_hla_class_I.fasta -o ${SAMPLE}.bam
    

    While fundamentally, the algorithms here are alignment based. They're too slow to run for all sequences. Sequences that do not originate from the HLA-region would just act as background noice.

  3. and then convert aligned reads back to FASTQ:

     $ samtools fastq ${SAMPLE}.bam > ${SAMPLE}_filtered.fastq
    
  4. Infer types (see $ multi_par --help for further details):

     $ multi_par path-to-imgthla/aignments ${SAMPLE}_filtered.fastq -o ${SAMPLE}_output.tsv
    

Note: The script src/scripts/run-example-docker.sh provides an end-to-end example of the above. It depends only on docker, wget, and git; it fetches the data and runs everything in a docker container (see sh src/scripts/run-example-docker.sh help).

1: All versions of this software after 0.8.0 incorporate an important coverage likelihood that is not described in the previous paper. At the moment a short addendum describing the approach is in limbo, please contact me by email for a reference.