Cancer Epitopes CSHL

Plan

Goal: given an SRA ID, prioritize and quantify variants with respect to immunogenicity (single score) + variant annotation

use pvacseq to generate the sequences around variants (wt/mutant)
use FRED2 to make binding predictions
given the different prediction score, make one immunogenicity score

Annotate VCF
Extract peptide sequences
Define sample's MHC alleles
Collect MHC-peptide affinity predictions
Add our own classifier score

More info:

1) Annotate RNAseq VCF

nohup  variant_effect_predictor.pl \
   --input_file /home/data/vcf/hisat_tags_output_SRR1616919.sorted.vcf  \
    --format vcf \
     --terms SO --offline  --force_overwrite \
      --plugin Wildtype --plugin Downstream  \
       --dir /home/data/vep   --vcf --symbol \
        --fork 16  --coding_only --no_intergenic \
         --output_file /home/data/imm/hisat_tags_output_SRR1616919.sorted.annotated.vcf  &

2a) Generate FASTA with `pVACSeq` and write to csv

The genome information stored in the vcf file need to be translated into corresponding peptide sequences. For this, we use parts of pVacSeq, which generates a csv file with 21-mers surrounding the mutation (both, WT and mutant form)

    cd $HOME  
    git clone https://github.com/NCBI-Hackathons/Cancer_Epitopes_CSHL.git
 
    source activate python3 
    
    cd $HOME/Cancer_Epitopes_CSHL/src   
    python -c 'import generate_fasta; generate_fasta.generate_fasta_dataframe("/home/devsci7/test.output.2", "/home/devsci7/step2.csv", 21,9)'

2b) Define MHC locus for HLA genotyping

Tools for HLA genotyping typically re-align the raw reads in order to identify the HLA type from RNA-seq. To obtain the reads roughly aligned to these genes we need to define the region and specify it during the alignment process. The MHC complex consists of more than 200 genes located close together on chromosome 6. The script hla_type.sh extracts reads overlapping with the MHC locus and turns them into two fastq files. These reads are then re-aligned and analyzed by OptiType

  src/hla_type.sh -b /home/data/hisat_tags_output_SRR1616919.sorted.bam \
    -r NC_000006.12:29600000-33500000 -o test --path /opt/samtools/1.3.1/bin/

3) Compute immunogenicity for each peptide

Run the script: src/imm_predict/fred2_allele_prediction.py to compute the MHC binding affinity predictions for each reference and alternative allele.

Example

python2 ./src/imm_predict/fred2_allele_prediction.py \
        ./pvacseq_table.csv ./variant_immunogenicity.csv

Usage

       fred2_allele_prediction.py [--alleles=<alleles_list>] FILE_IN FILE_OUT                          
       fred2_allele_prediction.py -h | --help

Argument	Explanation
FILE IN (req.)	Input csv file with... ???
FILE OUT (req.)	Name for output csv file
alleles	Comma separated list of target alleles, e.g., `--alleles="B27:20,B83:01,A*32:15"` [Default: use all]

4. Create and explore the training data set

GOAL: classifier learnt on peptides with reported immunogenic cancer neoantigens (WT sequences without immunogenicity)

Collection of training sets

To play around with the data sets, here are the current workflows:

# read in the data (this expects that you're in Cancer_Epitopes_CSHL;
# just check the paths that are hard-coded at the moment within the
# script
Rscript --vanilla --slave src/imm_explore/0-read.R
# this should have created data/immunogenic_SNVs-training_sets.csv

export TMPDIR=/tmp/
PYTHON=/opt/modules/i12g/anaconda/3-4.1.1/envs/python27/bin/python 

# calculate the binding affinities
PYTHON src/imm_explore/fred2_design_matrix.py \
	--input=data/immunogenic_SNVs-training_sets.csv \
	--output=data/immunogenic_SNVs-model_data.csv

Compute the background protein immunogenicity [optional]

This needs some discussion and perhaps work?

Given a FASTA file with all human proteins, compute the immunogenicity for all posible 9-mer peptides.

NOTE: This will take a very long time to compute!

Script: src/imm_predict/fred2_background.py.

Usage:

       fred2_background.py [--alleles=<alleles_list> --top_N=N] FILE_IN FILE_OUT                       
       fred2_background.py -h | --help

Argument	Explanation
FILE IN (req.)	Input FASTA file with all human proteins, can be retrieved from ENSEMBL
FILE OUT (req.)	Name for output csv file
top N	Number of top N proteins to compute the background for. [Default: all].
alleles	Comma separated list of target alleles, e.g., `--alleles="B27:20,B83:01,A*32:15"` [Default: use all]

Example:

python2 ./src/imm_predict/fred2_background.py \
        ./Homo_sapiens.GRCh38.pep.all.fixheader.fa ./background_peptides.csv

Copy file over

Not sure where this belongs?

cp  /home/devsci7/step2.fasta   /home/data/imm

friedue/Cancer_Epitopes_CSHL