HLAEquity

Finding peptide targets that bind to HLAs and trigger an immune response is challenging, and peptide-HLA (pHLA) binding prediction is one of the crucial steps in the development of personalized peptide vaccines. Machine Learning (ML) pHLA binding prediction tools are trained on vast amounts of pHLA binding data. ML predictions are effective in guiding the search for therapeutic peptide targets. However, the use of datasets with imbalanced allele content raises concerns about biased performance toward certain geographic populations. We examine the bias of two ML-based pan-allele pHLA binding affinity predictors. We aim to draw attention to the potential therapeutic consequences of this bias, and we challenge the use of the term "pan-allele" to describe models trained with currently available public datasets.

In this repo you can find all the scripts used to perform the given analysis.

Check out the following subsections of this repo:

Software requirements gives an overview of software needed to run the analysis and provides installation instructions.
Interactive dashboards gives a chance for readers to take an interactive look into the data presented in the paper.
Datasets and methods gives details of the content of this repo and outlines steps performed with the notebooks.
Additional Resources provides background on related databases and literature

Software requirements

Option 1 - running the scripts with Binder

The simplest way to run the analysis is to use our binder connection:

Be patient as Binder might take some time to load. Then follow the instructions given by the Datasets and methods section and readme files in the subfolders to perform the analysis.

Option 2 - running the scripts locally

Most of the analysis is performed in python and wrapped in jupyter notebooks. You will need:

Python (version 3.7 or later)
Packages listed in the requirements.txt, most notably:
- plotly
- pandas
- scipy
- numpy
Jupyter
IEDB population coverage tool

Analysis related to the algorithmic bias is performed in R. To run this analysis you will need RStudio. Clone the content of this repository and follow the instructions given by the Datasets and methods section

Interactive dashboards

Using binder, jupyter and voila we constructed interactive dashboards with the data we analyzed.

⚠️⏰ The dashboards should take up to 2 min to load!⏰⚠️ (if you're experiencing longer waits, please report an issue)

1. Take a look at HLA allele frequencies across different populations (from the AFND database).

See allele frequencies per populations - select one or more populations from the dropdown using CTRL+SHIFT keys
See a hierarchical clustering of populations based on their HLA allele frequencies (populations clustered together have a similar HLA allele content)
See a UMAP embedding of the populations based on the allele: populations close to each other have similar allele profiles; select populations to display their alleles.

2. Take a look at the allele content of training datasets (NetMHCpan4.1 and MHCFlurry2.0)

See the number of peptides per allele
See the population coverage of different datasets

3. Inspect the literature origin of the IEDB currated data

Datasets and methods

1. Mapping HLA alleles to geographic populations and classifying them by income level

1.1 Allele Frequency Net Database (AFND) scraping

To get the frequencies of HLA alleles across different populations we query the AFND database to get the most recent allele frequencies. More details here.

Code for running this is in the AFND_population_frequencies folder. This code allows you to:

Download the AFND HLA allele frequencies per locus for all available populations
Combine the AFND frequencies across different loci
Visualize and analyze the AFND data

The result of this step is a population allele frequency map AFND_data_locus_all.csv later used for calculating the population coverage as the source of ground truth allele frequency labels.

1.2 Classifying geographic populations by their level of income

Code for running this is in the WorldBank_Income_levels folder. We download the country income levels from the World Bank here (current classification by income in XLSX format) and process them.

The result of this step is a population_income_map.csv which maps the AFND populations to countries and respective income levels

2. Examining data bias

Here we calculate the population coverage of different training datasets (NetMHCpan4.1 and MHCFlurry2.0 - binding affinity and mass spectrometry data). The related scripts are in the Datasets_population_coverage folder.

2.1 Download and prepare the datasets

Find the detailed instructions and notebooks for downloading the datasets inside the ./datasets/ folder here.

Prepare the NetMHCpan4.1 datasets
Prepare the MHCFlurry2.0 datastes
Visualize the dataset content

2.2 Calculate the population coverage

Use the ./CalculateCoverage_IEDB.ipynb notebook to calculate the population coverage of the datasets. This notebook will guide you through the following steps:

Downloading and setting up the IEDB population coverage tool from here
Running the population coverage for a single population and a single dataset
Running the population coverage for all datasets and all populations

2.3. Visualize the population coverage of the datasets

Run the ./Visualize_results.ipynb notebook to see how different datasets cover the given populations and how this coverage relates to different income levels and ancestries.

3. Examining algorithmic bias

To examine the algorithmic bias we use a recently published independent dataset collected by Pyke et al.. This dataset contains MS data for some rare alleles that were not sampled before (i.e., A02:52, B15:13). More details here.

The code for running this analysis is in the folder Algorithmic_bias_analysis. This code allows you to:

Prepare_dataset.R: Filter and prepare the dataset from Pyke et al. used in the analysis. Namely, peptides with any chemical modifications are removed, and negatives are sampled from the human proteome.
Motif_generation.R: Generate motifs for calculating FOOP scores (see paper for more details)
Data_analysis.R: Calculate PPV/FOOP scores and generate plots

Additional Resources

Databases

AFND Allele frequency net database

Currated frequencies across regions/ethnicities!! (what we need) A couple of resources to access this data

immunotation R package
hladownload python package
IPD-IMGT Immuno Polymorphism Database, international ImMunoGeneTics information system

Contains raw sample data - if we need more details.
IEDB The immune epitope database

Contains more raw data (on the binding affinity / mass spec side). Contains the population coverage module
hla.alleles.org

Literature

Works that mention/address the possible bias in the therapeutics.
- Bui et al., 2005: Predicting population coverage of T-cell epitope-based diagnostics and vaccines
  
  A disproportionate amount of MHC polymorphism occurs in positions constituting the peptide-binding region, and as a result, MHC molecules exhibit a widely varying binding specificity. In the design of peptide-based vaccines and diagnostics, the issue of population coverage in relation to MHC polymorphism is further complicated by the fact that different HLA types are expressed at dramatically different frequencies in different ethnicities. Thus, without careful consideration, a vaccine or diagnostic with ethnically biased population coverage could result.
- Oyarzun et al., 2015: A bioinformatics tool for epitope-based vaccine design that accounts for human ethnic diversity: Application to emerging infectious diseases
  
  Predivac-2.0 is a novel approach in epitope-based vaccine design, particularly suited to be applied to virus-related emerging infectious diseases, because the geographic distributions of the viruses are well defined and ethnic populations in need of vaccination can be determined (“ethnicity-oriented approach”). Predivac-2.0 is accessible through the website http://predivac.biosci.uq.edu.au/.
- Sarkizova et al. - HLAAthena
- Pyke et al. 2021: Precision Neoantigen Discovery Using Large-scale Immunopeptidomes and Composite Modeling of MHC Peptide Presentation

KavrakiLab/HLAequity