VDJmatch is a command-line tool designed for matching T-cell receptor (TCR) repertoires against a database of TCR sequences with known antigen specificity. VDJmatch implements an API for interacting and querying the VDJdb database, and serves as a backend for VDJdb web browser. VDJmatch will automatically download and use the latest version of the VDJdb database, however, it is also possible to use a custom database provided by user if it matches VDJdb format specification.
VDJmatch accepts TCR clonotype table(s) as an input and relies on VDJtools framework to parse the output of commonly used immune repertoire sequencing (RepSeq) processing tools. See format section of VDJtools docs for the list of supported formats. Note that VDJmatch can be used with metadata semantics introduced by VDJtools to facilitate running annotation for multi-sample datasets.
VDJdb is distributed as an executable JAR that can be downloaded from the releases section, the software is cross-platform and requires Java v1.8 or higher to run.
To run the executable JAR use the java -jar path/to/vdjmatch-version.jar match [options]
command as described below. Running without any [options]
or with -h
option will display the help message.
The latest version of VDJdb will be downloaded the first time you run VDJmatch. Note that in order to update to the most recent version next time, you will need to run java -jar path/to/vdjmatch-version.jar Update
command.
The following syntax should be used to run VDJmatch for RepSeq sample(s)
java -Xmx4G -jar path/to/vdjmatch-version.jar match \
[options] [sample1 sample2 sample3 ... if -m is not specified] output_prefix
First part of the command runs the JAR file and sets the memory limit to 4GB (should be increased in case JVM drops with heap size exception) and points to VDJmatch executable JAR (version
should be replaced with the software version). The second part includes options, input samples and the prefix of output files.
Option name | Argument example | Description |
---|---|---|
‑h |
Display help message | |
‑m , ‑‑metadata |
/path/to/metadata.txt |
A metadata file, holding paths to samples and user-provided information. |
‑‑software |
MITCR ,MIGEC ,etc |
Input RepSeq data format, see formats supported for conversion. By default expects input in VDJtools format. |
‑c , ‑‑compress |
Compress sample-level summary output with GZIP. |
If -m
option is specified the list of sample file names should be omitted and the list of options should be followed by output_prefix
.
Option name | Argument example | Description |
---|---|---|
‑S , ‑‑species |
human ,mouse ,etc |
(Required) Species name. All samples should belong to the same species, only one species is allowed. |
‑R , ‑‑gene |
TRA ,TRB ,etc |
(Required) Name of the receptor gene. All samples should contain to the same receptor gene, only one gene is allowed. |
‑‑filter |
"__antigen.species__=~'EBV'" |
(Advanced) Logical filter expresstion that will be evaluated for database columns. |
‑‑vdjdb‑conf |
1 |
VDJdb confidence level threshold, from 0 (lowest) to 3 (highest), default is 0 . |
‑‑min‑epi‑size |
10 |
Minimal number of unique CDR3 sequences per epitope in VDJdb, filters underrepresented epitopes. Default is 10 |
The --filter
option supports Java/Groovy syntax, Regex, .contains()
, .startsWith()
, etc. Parts of the expression marked with double underscore (__
, e.g. __antigen.epitope__
) will be substituted with corresponding values from database rows. Those parts should be named exactly as columns in the database, see VDJdb specification for the list of column names.
VDJdb confidence level used by --vdjdb-conf
is assigned based on the details of TCR specificity assay for each VDJdb record, see VDJdb confidence scoring for details on this procedure.
Option name | Argument example | Description |
---|---|---|
‑‑database |
/path/to/my_db |
Path and prefix of an external database. Should point to files with '.txt', and '.meta.txt' suffices (the database itself and database metadata). |
‑‑use‑fat‑db |
In case running with a built-in database, will use full database version instead of slim one. |
Full database contains extended info on method used to identify a given specific TCR and sample source, but has a higher degree of redundancy (several identical TCR:pMHC pairs from different publications, etc) that can complicate post-analysis
Option name | Argument example | Description |
---|---|---|
‑‑v‑match |
Require exact Variable segment ID match (ignoring alleles) when searching the database. | |
‑‑j‑match |
Require exact Joining segment ID match (ignoring alleles) when searching the database. | |
‑O , ‑‑search‑scope |
2,1,2 , 3,0,0,3 , ... |
Sets CDR3 sequence search parameters aka search scope: allowed number of substitutions (s ), insertions (i ), deletions (d ) / or indels (id ) and total number of mutations (t ). Default is 0,0,0 |
‑‑search‑exhaustive |
0 , 1 or 2 |
Perform exhaustive CDR3 alignment: 0 - no (fast), 1 - check and select best alignment for smallest edit distance, 2 - select best alignment across all edit distances within search scope (slow). Default is 1 . |
Search scope should be specified in either s,i,d,t
or s,id,t
form. While the second form is symmetric and counts the sum of insertions and deletions (indels), the first form is not symmetric - insertions and deletions are counted with respect to the query TCR sequence (i.e. clonotype records from input samples). Total number of mutations t
specifies the edit distance threshold.
Note that VDJmatch running time can greatly increase for large (wider than
4,2,4
) search scopes.
With a --search-exhaustive 2
the algorithm will compute an exact global alignment for CDR3 sequences, which is quite slow, for a small/moderate search scope (2 or less indels) --search-exhaustive 1
is effectively the same as --search-exhaustive 2
. Exhaustive search will choose the best alignment based on the VDJAM scoring (see below), this option has no effect if full VDJMATCH scoring is not used.
Option name | Argument example | Description |
---|---|---|
‑A , ‑‑scoring‑vdjmatch |
Use full VDJMATCH algorithm that computes full alignment score as a function of CDR3 mutations (weighted with VDJAM scoring matrix) and pre-computed V/J segment match scores. | |
‑‑scoring‑mode |
0 or 1 |
Either 0 : scores mismatches only (faster) or 1 : compute scoring for whole sequences (slower). Default is 1 . |
If --scoring-vdjmatch
is not set, will just count the number of mismatches and ignore V/J segment alignment.
CDR3 alignment score is computed as:
-
--scoring-mode 0
$(CDR3_1, CDR3_2) = \sum_s [M(s_1,s_2) - max(M(s_1,s_1), M(s_2,s_2))] - \sum_{g} M(g_1,g_1)$ where$s: 1 \rightarrow 2$ stands for substitution,$g: 1 \rightarrow '-'$ stands for gap and$M(1,2)$ is the VDJAM matrix -
--scoring-mode 1
$S(CDR3_1, CDR3_2) = aln(CDR3_1, CDR3_2) - max(aln(CDR3_1, CDR3_1), aln(CDR3_2, CDR3_2))$ where$aln(CDR3_1, CDR3_2)$ is the global alignment score without gap penalty between sequences$CDR3_1$ and$CDR3_2$ using VDJAM matrix
Full score / probability of matching the same antigen is computed using a Generalized Linear Model with cloglog link as
Option name | Argument example | Description |
---|---|---|
‑T , ‑‑hit‑filter‑score |
Drops hits with a score less than the specified threshold. | |
‑X , ‑‑hit‑filter‑max |
Only select hit with maximal score for a given query clonotype (will consider all max score hits in case of ties). | |
‑‑hit‑filter‑topn |
3 |
Select best n hits by score (can randomly drop hits in case of ties). |
‑‑hit‑weight‑inf |
Weight database hits by their 'informativeness', i.e. the log probability of them being matched by chance. |
Note that score threshold is applied to unweighted (see below) full scores, thus has little sense to use in case --scoring-vdjmatch
is not set. For VDJmatch scoring the range of scores is [0, 1]
and the recommended value for the threshold lies in the range of 0.1-0.5
. Filtered records will be removed from output annotation files and will not affect the resulting summary statistics.
The idea behind --hit-weight-info
is to give less score to more redundant 'public' clonotypes that are likely to be found in many donors simply by chance. The weight is computed as
Option name | Argument example | Description |
---|---|---|
‑‑summary‑columns |
antigen.species,antigen.gene |
Table columns for which a summary output is provided for each sample, see VDJdb specification and database metadata file for more information on available columns. |
Default summary columns are mhc.class,antigen.species,antigen.gene,antigen.epitope
, see VDJdb specification for the full list of column names.
The following output files will be generated:
-
$output_prefix.annot.summary.txt
annotation summary containing the number of unique clonotypes (unique
), their cumulative share of reads (frequency
) and total read count (reads
).- Sample metadata will be appended to this table if provided via the
-m
option. - Each row corresponds to a combination of database field values from the columns specified by the
--summary-columns
option (e.g. epitope and parent species,antigen.epitope + antigen.species
). If a single clonotype is matched to several VDJdb records, its reads count and frequency and will be appended to all of them and theunique
counter for each of the records will be incremented by1
. - The weight/informativeness sum of database hits for each row is stored in the
weight
column and can be used to scale the results, together with thedb.unique
column, storing the total number of unique database TCR entries for a given combination of summary columns. - Each of database records is tagged as
entry
incounter.type
column of summary table, statistics (total number of clonotypes, read share and count) of annotated and unannotated clonotypes is stored in rows tagged asfound
andnot.found
respectively.
- Sample metadata will be appended to this table if provided via the
-
$output_prefix.$sample_id.txt
annotation for each of the clonotypes found in database, a separate file is generated for each input sample.- This is an all-to-all merge table between the sample and database that includes all matches.
- Clonotype information from the sample (count, frequency, cdr3 sequence, v/d/j segments and v/d/j markup) is preserved.
- As a clonotype can be represented by multiple rows in the output (i.e. match to several records in the database),
id.in.sample
column can be used to make the correspondence between annotation record and 0-based index of clonotype in the original sample. For the information on database columns that are appended see database schema in VDJdb-db repository readme. - The
score
column contains CDR3 alignment score that is computed as described Scoring parameters section (not to be confused with VDJdb record confidence score. - The
weight
column contains the weight (or informativeness) of corresponding database records, see Hit filtering and weighting section for details.
Can be performed using the cluster
routine as follows:
java -Xmx4G -jar path/to/vdjmatch-version.jar cluster [options] input_sample output_prefix
The set of options can be displayed by running cluster -h
, they are essentially the same as the ones for match
routine (except for database-related options). Output file is a tab-delimited table that contains pairs of clonotypes (as in sample), their IDs and distances between them. Scoring and filtering options can be used in the same way as with match
routine.
When running with VDJtools output, all annotations generated by VDJtools e.g. NDN size, clonotype incidence for pooled samples, frequency vector in a joint sample, ... will be preserved and VDJdb-standalone annotation columns will be added after them. Vice-versa, VDJdb-annotated samples can be used in VDJtools analysis when keeping in mind the following 1) no conversion to VDJtools format is needed, 2) as a single clonotype can be reported several times many descriptive statistics are not applicable. An example usage of both VDJdb-standalone and VDJtools:
$VDJTOOLS PoolSample -m metadata.txt .
$VDJDB -S human -R TRB --filter="__antigen.species__=~'CMV'" pool.aa.table.txt .
$VDJTOOLS ApplySampleAsFilter -m metadata.txt pool.aa.table.annot.txt filtered/
The filtered/
folder will now contain all samples (and corresponding metadata.txt
) with CMV-specific clonotypes that can be used for further VDJtools analysis.
A web-based GUI for querying VDJdb and annotating RepSeq samples can be both accessed at vdjdb.cdr3.net (public server) and installed as a local server (see VDJdb-web repository for details). Local installation can be configured to remove file size/search parameter limits that are enforced in public server.
Just clone the repository and run gradle clean build
.