/kpi

Structural interpretation of Killer cell immunoglobulin-like Receptors (KIR) haplotypes from raw short or long read sequences. It predicts the presence/absence of 16 KIR genes and then uses that to predict pairs of structural (gene-content and order) haplotypes.

Primary LanguageGroovyGNU General Public License v3.0GPL-3.0

KPI

main.nf makes the predictions.

Dependencies

Install Java, Groovy, Nextflow, Docker, and Git. Create accounts in GitHub and Docker Hub. Add 'docker.enabled = true' and 'docker.fixOwnership = true' to your Nexflow configuration (e.g., $HOME/.nextflow/config). Make sure Docker is running and you are logged in to Docker Hub.

Running

Input
There are two input options.
1. An ID along with a folder of fasta or fastq files, optionally gzipped. (--raw and --id)
2. A two-column text file, where the first column is an ID, and the second column is a path to a fasta or fastq file (--map). Each ID may have multiple rows. The paths to the files be absolute or relative, but the files must be in the same directory as the map file or under it. If using relative paths, the paths must start with the _parent_ folder of the map file.

Option 1 is more efficient with respect to disk space.

Output
For each input ID, an output text file will be created named '_prediction.txt'. Each ID's output file contains a header line and a second line with the haplotype pair predictions and gene predictions.
Each haplotype within a pair is separated by a '+'. If the prediction is ambiguous, each pair of haplotypes is separated by '|'. e.g.,
'cA01˜tA01+cA01˜tB01|cA01˜tA01+cB05˜tB01|cA01˜tB01+cB05˜tA01' means haplotype
'cA01˜tA01 and cA01˜tB01' or 'cA01˜tA01 and cB05˜tB01' or 'cA01˜tB01 and cB05˜tA01'.

The reference haplotypes are defined at https://github.com/droeatumn/kpi/blob/master/input/haps.txt

Running
Use 'raw' to indicate the input directory, and 'output' to indicate the directory to put the output. The defaults are 'raw' and 'output' under the location where KPI was pulled.

Option 1: Provide and ID (--id) and a folder (--raw) with its raw data
./main.nf --id ID --raw inDir --output outDir
e.g., ./main.nf --id id1 --raw ~/input --output ~/output

Option 2: Provide a file with a map (--map) from IDs to their raw data
./main.nf --map mapFile.txt --output outDir
e.g., ./main.nf --map ~/input/idstoRaw.txt --output ~/output
In this example the path to files in idstoRaw.txt are somewhere under ~/input/.

Example using data in the image, so no input is required.
Example 1: cA01˜tA01+cB01˜tB01 with --raw.
Run the following command for an example of interpreting synthetic reads created from sequences with Genbank IDs KP420439 and KP420440 (https://www.ncbi.nlm.nih.gov/nuccore/KP420439 and https://www.ncbi.nlm.nih.gov/nuccore/KP420440)). These two haplotypes contain all the genes, so the haplotype predictions are very ambiguous.

./main.nf --id ex1 --raw ~/git/kpi/input/example1 --output ~/output

To run another example, replace 'example1' with 'example2'.

Example 2: cA01˜tA01+cA01˜tB01 with --map and --id.
Run the following command for an example of interpreting synthetic reads created from sequences with Genbank IDs KP420439 and KU645197 (https://www.ncbi.nlm.nih.gov/nuccore/KP420439 and https://www.ncbi.nlm.nih.gov/nuccore/KU645197)).

./main.nf --id ex2 --map ~/git/kpi/input/example2/example2.txt --output ~/output

To run another example, replace 'example2' with 'example1'.

Example 3: combine Example 1 and 2 with --map and --id.
./main.nf --id ex12 --map ~/git/kpi/input/example1-2.txt --output ~/output

Miscellaneous
Hardware
For targeted sequencing, kpi requires at least 20G RAM total and 1G temp disk space/ID. For WGS, it requires 30G RAM total and 15G temp disk space. It will scale to the number of CPUs available, with 6-12 being most efficient in general for WGS.

Raw data
The software assumes average coverage for both chromosomes is less than 255. If this is not the case for your data, please downsample before running. Support for high coverage data is a future enhancement.

Containers
To run without a container, use the --nocontainer parameter. To use a container other than the default (droeatumn/kpi:latest), use the --container parameter.

To run in a self-contained environment with the --id parameter. Replace 'inDir' and 'outDir'.
docker run --rm -it -v inDir:/opt/kpi/raw/ -v outDir:/opt/kpi/output/ droeatumn/kpi:latest /opt/kpi/main.nf --id

Reference
The preprint "Accurate and Efficient KIR Gene and Haplotype Inference from Genome Sequencing Reads with Novel K-mer Signatures" is available at https://www.biorxiv.org/content/10.1101/541938v2.