ALICE

Detecting TCR involved in immune responses from single RepSeq datasets.

About

Here we provide an R implementation of ALICE approach, see paper for details.

Software requirements

Any OS where R is available (Linux, OS X, Windows), however parallel computing is currently not available on Windows.

Installation

Install R distribution of choice (i.e. from R Core team, or Microsoft R Open )
Install BioStrings package from bioconductor: open R console and execute following commands:

source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")

Install data.table, stringdist, igraph packages:

install.packages("igraph")
install.packages("data.table")
install.packages("stringdist")

Download this github repository.

Quick start

Lets load some data first and organize it to the list.

This is one VJ combination (TRBV9-TRBJ2-7) from S1 donor from link on day 0 and day 15 after yellow fever immunization.

library(data.table)
S1d15<-fread("sample/S1_d15_V9_J2_7.tsv")
S1d0<-fread("sample/S1_d0_V9_J2_7.tsv")
S1<-list(d0=S1d0,d15=S1d15)

Now lets run ALICE pipeline. Note, that algorithm will create a folder for files produced, if it does not exist:

source("ALICE.R")
S1_alice<-ALICE_pipeline(DTlist=S1,folder="S1_res",cores=1,iter=10,nrec=5e5) #this takes few minutes to run
sapply(S1_alice,nrow)

For this VJ-combination we have no significant results in day 0 timepoint, and 34 significant hits in day 15 timepoint.

Note, that for demo purposes we run it on 1 core with 10 iterations (0.5 mln sequences in each iteration) for generative probability estimation simulation. Total number of simulated TCR sequences (both inframe and out-of-frame) in this case is 5 million. In paper we used 200 mln simulated sequences for each VJ-class, and this takes a lot of time. Datasets from the paper are available here.

Input file format

Algorithm operates on R data.table with following mandatory columns:

CDR3.amino.acid.sequence, bestVGene, bestJGene, Read.count.

See sample datasets in sample folder.