BamQuery: a proteogenomic tool to explore the immunopeptidome and prioritize actionable tumor antigens

BamQuery is a computational pipeline that counts all RNA-seq reads that can code for a given peptide (8 to 11 residues) in chosen RNA-seq samples (single or bulk). The strength of BamQuery lies in its ability to quantify RNA expression from any genomic region: protein-coding exons or non-coding genomic regions, annotated or not. Briefly, it reverse-translates peptides into all possible coding sequences, aligns these sequences on the genome (human or mouse) with STAR, retains only regions (spliced or not) that have perfect alignments with the coding sequence, and queries (grep) the coding sequences at their respective coding regions in the RNA-seq sample (bam file) of interest. All primary reads (samtools view -F 256) that can code for the peptide are counted, normalized on the total primary read number of the sample, and listed in a detailed report.

BamQuery supports only genomically templated peptides, not proteasome-recombined peptides. Any non-mutated peptide (located on splicing sites or not) is supported. BamQuery supports single nucleotide mutated peptides by optionally including the full dbSNP database in the genomic alignment step. Mutated peptides that derive from indels or from regions not listed in dbSNP can be analyzed with the manual mode of BamQuery by providing the genomic region of origin of the peptide. Finally, BamQuery uses an expectation-maximization algorithm to annotate the most-likely biotype of each peptide analyzed based on the annotations of their genomic regions of origin and the number of reads present at each of these regions.

BamQuery can be installed as described here or tested with our online portal. The portal will quantify any peptide in medullary thymic epithelial cells and dendritic cells to evaluate their probability of being expressed by normal tissues. A prediction of their immunogenicity will also be provided based on these expression measures. Therefore, BamQuery can be used to evaluate the specificity and immunogenicity of tumor antigens.

Read our article in Genome Biology

For detailed usage instructions, see the documentation on our web page: https://bamquery.iric.ca/

BamQuery was designed and developed by Gregory Ehx (GIGA Institute, University of Liege) and Maria Virginia Ruiz Cuevas (Institute for Research in Immunology and Cancer (IRIC), University of Montreal).

Required software:

Required Python package:

Required R package:

Installation

Installation From source

See the user manual for a detailed description of usage.

1. Clone repository from github

    export INSTALLDIR=./opt/bamquery
    mkdir $INSTALLDIR
    cd $INSTALLDIR
    git clone https://github.com/lemieux-lab/BamQuery.git

2. Install required library files within $INSTALLDIR:

    wget https://bamquery.iric.ca/download/lib_essentials.tar.gz
    tar vxzf lib_essentials.tar.gz

2.a Installation of genomes

BamQuery supports three different versions of the human genome (v26_88 / v33_99 / v38_104) and two versions of the mouse genome (GRCm38 and GRCm39, respectively: M24 / M30).

You need to download the human or mouse genome version you wish to use to:

    cd lib/genome_versions

And use the command below to download any human genome version: v26_88 or v33_99 or v38_104.

Change SET_VERSION for any of the genome versions.

    wget https://bamquery.iric.ca/download/genome_SET_VERSION.tar.gz

or to download any mouse genome version: m24, m30.

    wget https://bamquery.iric.ca/download/genome_mouse_SET_VERSION.tar.gz

Finally, you need to:

    tar vxzf GENOME_VERSION.tar.gz

Change GENOME_VERSION by the name of the genome version that was downloaded. Example: genome_v26_88.tar.gz

2.b Installation of SNPs

BamQuery supports three different versions of dbSNPs of the human genome (149/151/155) and two versions of dbSNPs of the mouse genome (snps_GRCm38 and snps_GRCm39, respectively: M24 / M30).

You can download the annotated snps you need to (by default BamQuery does not use snps):

    cd lib/snps

And use the command below to download any dbSNP corresponding to human genome releases 149 or 151 or 155.

    wget https://bamquery.iric.ca/download/dbsnps_SET_RELEASE.tar.gz

or to download any dbSNP corresponding to mouse genome releases GRCm38 or GRCm39.

    wget https://bamquery.iric.ca/download/snps_mouse_SET_RELEASE.tar.gz

Finally, you need to:

    tar vxzf SNPS_RELEASE.tar.gz

3. Create a virtual environment and install dependencies

Option 1: Installation with Conda

For users having no administrator privileges, we recommend installing BamQuery with conda.

First create a conda environment and activate it:
```
 conda create -n BQ
 conda activate BQ
```

Then install all dependencies:

 conda install -y -c bioconda pysam
 conda install -y -c anaconda pandas
 conda install -y -c conda-forge pathos
 conda install -y -c conda-forge xlsxwriter
 conda install -y -c anaconda seaborn
 conda install -y -c conda-forge billiard
 conda install -y -c conda-forge biopython
 conda install -y -c anaconda scipy
 conda install -y -c bioconda bedtools
 conda install -y -c bioconda star=2.7.9a
 conda install -y -c conda-forge mamba
 mamba install -y -c conda-forge r-ggplot2
 mamba install -y -c conda-forge r-data.table

Launch the analysis:

 conda activate BQ
 python3 ${INSTALLDIR}/BamQuery/BamQuery.py path_to_input_folder name_exp genome_version

Option 2: Installation from source

Download Python 3 and create a virtual environment. Python: https://www.python.org/
```
 python3 -m venv bamquery-venv
 source ${INSTALLDIR}/env/bin/activate
```

Install Python packages in the virtual environment

 pip install --upgrade pip
 pip install pandas
 pip install pysam
 pip install pathos
 pip install xlsxwriter
 pip install seaborn
 pip install billiard
 pip install numpy
 pip install scipy
 pip install biopython

Install external dependencies so that their binaries are available in your $PATH:

STAR 2.7.9a: https://github.com/alexdobin/STAR
bedtools: https://bedtools.readthedocs.io/en/latest/
R: https://www.r-project.org/, required R packages: ggplot2, data.table

Launch the analysis

 python3 ${INSTALLDIR}/BamQuery/BamQuery.py path_to_input_folder name_exp genome_version

Installation using the provided docker container

A docker container is also available to provide a self contained working environment.

1. Create an install folder:

    export INSTALLDIR=/opt/bamquery
    mkdir $INSTALLDIR
    cd $INSTALLDIR

2. Download the docker image:

    wget https://bamquery.iric.ca/download/bamquery-2023-07-03.tar.gz

3. Install the docker image (requires sudo access):

    gunzip bamquery-2023-07-03.gz
    sudo docker load --input bamquery-2023-07-03

4. Install required library files within $INSTALLDIR:

Please, follow the instructions in step 2 enumerated above.

5. Launch the analysis from the docker container:

    sudo docker run -i -t  \
    --user $(id -u):$(id -g) \
    -v $INSTALLDIR/lib:/opt/bamquery/lib \
    -v $DATAFOLDER:$DATAFOLDER  \
    -v $PWD:$PWD \
    iric/bamquery:0.2 python3 /opt/bamquery/BamQuery/BamQuery.py path_to_input_folder name_exp genome_version

making sure to map any required folder mentioned in the input files (BAM locations, input folder) so that these paths may be available from within the container. This is done with multiple arguments -v $DATAFOLDER:$DATAFOLDER (where $DATAFOLDER is to be replaced by an actual folder name) and -v $PWD:$PWD if needed. Note also that we force the application to run with user permissions instead of root using the --user $(id -u):$(id -g) argument.

For more information on configuration, see: https://bamquery.iric.ca/documentation/configuration.html

Note: BamQuery requires a specific folder structure to work.

Once BamQuery is installed, check that the structure looks as follows:

    .
    ├── BamQuery
    │   ├── BamQuery.py
    │   ├── genomics
    │   ├── plotting
    │   ├── readers
    │   ├── README.md
    │   └── utils
    └── lib
        ├── coefficients.dic
        ├── Cosmic_info.dic
        ├── ERE_info.dic
        ├── ERE_info_mouse.dic
        ├── EREs_souris.bed
        ├── genome_versions
        │   ├── genome_mouse_m24
        │   ├── genome_mouse_m30
        │   ├── genome_v26_88
        │   ├── genome_v33_99
        │   └── genome_v38_104
        ├── hg38_ucsc_repeatmasker.gtf
        ├── README.txt
        └── snps
            ├── snps_dics_149
            ├── snps_dics_149_common
            ├── snps_dics_151
            ├── snps_dics_151_common
            ├── snps_dics_155
            └── snps_dics_155_common