This repository contains information on the steps followed to identify conserved cross species b-cell linear epitopes associated with human malaria
Scripts included in this analysis were run on
- Bash
- Conda environment.
- R
- Nextflow
- Conda and Docker used to run different programs
- Whole genome sequencing data from field isolates used to analyze the conservation of individual predicted antigens containing epitopes of interest.
- Prediction tools including Deeploc, WoLF PSORT, Phobius, Bepipred 2.0 and Deeploc
Plasmodium proteins from five species associated with human malaria were downloaded and CD-HIT was used to remove redundant proteins with a threshold of 90%.
WOLF PSORT(https://wolfpsort.hgc.jp/) and Deeploc (https://services.healthtech.dtu.dk/services/DeepLoc-2.0/) were used to predict the subcellular location of the proteins using P. falciparum as the reference organism. Extracellular and membrane bound proteins were then selected based on the Rscript subcellular_location_pred.R Intracellular proteins were also eliminated using their protein names (epitope_analysis.sh).
The presence of a signal peptide and transmembrane helice predicted by using Phobius (https://phobius.sbc.su.se/). Proteins with one signal peptide or one transmembrane helix were selected using the Rscript surfaceome_analysis.R
We retrieved mass spectrometry data on proteins located on the late schizont-merozoite stage and cross-selected against proteins located on the surface of the cell
We utilized Bepipred 2.0 and Epidope (https://github.com/flomock/EpiDope) to predict epitopes. Bepipred 2.0 is a tool used to predict the presence of b-cell linear epitopes. We then made an R script bepipred_extractor.R. Currently, Bepipred 2.0 is hosted at DTU University found here https://services.healthtech.dtu.dk/services/BepiPred-2.0/. Bepipred accepts a maximum of 50 protein sequences and 300,000 amino acids per submission with the length ranging from 10-300000. The output for multiple protein sequences can be downloaded in the form of a CSV file. However, the output contains raw b-cell epitope sequences and takes time to process especially on epitopes that meet the defined threshold of 0.5. Herein I tried to extract the exposed epitopes that met the threshold score of 0.5. The final output file is in the form of a table containing the protein_ID, start and end position, epitope sequence, and length. The first trial is run with 50 protein sequences.
We used BLAST to create a non-falciparum and human database and blasted the predicted B-cell linear epitopes against these database. We selected epitopes that had percentage similarity > 70, coverage>95% and length 10-30 amino acids. We further used Vaxijen 2.0 (https://www.ddg-pharmfac.net/vaxijen/VaxiJen/VaxiJen.html) to check for antigenicity, AllerTop (https://www.ddg-pharmfac.net/AllerTOP/) for allergenicity and ToxinPred(https://webs.iiitd.edu.in/raghava/toxinpred/) for toxicity. We selected epitopes that were cross-species, non-human, non-allergen, non-toxic and antigenic.
We analyzed 27 locally sequenced isolated using the nextflow pipeline plasmodium_variant.nf. We obtained variant calling files and used the script epitope_extraction.sh to extract amino acid sequences containing the peptides of interest.