/covid19

covid19 exploration

Primary LanguageRMIT LicenseMIT

covid19

Read our paper in Journal of Virology

Software requirements

R and R packages

We ran R scripts with R v3.6.2. We used the following packages installable from CRAN:

  • dplyr (v0.85)
  • rvest (v0.3.5)
  • rworldmap (v1.3-6)
  • ggplot2 (v3.3.0)

Python and python packages:

We ran python scripts with Python 3.7, installed using Anaconda. See docstrings of *.py files for further information on reproducing results. Packages were installed with pip3 and include:

  • matplotlib
  • mmh3

Data

SARS-CoV2 and other coronavirus protein sequences

We manually downloaded the full polyprotein 1ab (ORF1ab), spike (S) protein, membrane (M) protein, envelope (E) protein, and nucleocapsid (N) protein sequence FASTA files for the following 34 coronavirus species from the National Center of Biotechnology Information (NCBI) Reference Sequence Database. Note that for SARS-CoV-2 and SARS-CoV we downloaded FASTA files for all additional annotated protein sequences comprising their full viral proteomes:

Name ORF1ab Spike Envelope Membrane Nucleocapsid
SARS-CoV-2* YP_009724389.1 YP_009724390.1 YP_009724392.1 YP_009724393.1 YP_009724397.2
SARS-CoV* NP_828849.2 NP_828851.1 NP_828854.1 NP_828855.1 NP_828858.1
OC43 YP_009555238.1 YP_009555241.1 YP_009555243.1 YP_009555244.1 YP_009555245.1
Bovine-CoV NP_150073.2 NP_150077.1 NP_150081.1 NP_150082.1 NP_150083.1
HKU24 YP_009113022.1 YP_009113025.1 YP_009113028.1 YP_009113029.1 YP_009113031.1
HKU1 YP_173236.1 YP_173238.1 YP_173240.1 YP_173241.1 YP_173242.1
MHV AAU06353.1 AAU06356.1 AAU06359.1 AAU06360.1 NP_045302.1
Rat-CoV YP_003029844.1 YP_003029848.1 YP_003029850.1 YP_003029851.1 YP_003029852.1
Bat-BCoV YP_009072438.1 YP_009072440.1 YP_009072442.1 YP_009072443.1 YP_009072446.1
Hedgehog-CoV YP_009513008.1 YP_009513010.1 YP_009513016.1 YP_009513017.1 YP_009513018.1
MERS-CoV YP_009047202.1 YP_009047204.1 YP_009047209.1 YP_009047210.1 YP_009047211.1
HKU4 YP_001039952.1 YP_001039953.1 YP_001039958.1 YP_001039959.1 YP_001039960.1
HKU5 YP_001039961.1 YP_001039962.1 YP_001039967.1 YP_001039968.1 YP_001039969.1
GCCDC1 YP_009273004.1 YP_009273005.1 YP_009273007.1 YP_009273008.1 YP_009273009.1
HKU9 YP_001039970.1 YP_001039971.1 YP_001039973.1 YP_001039974.1 YP_001039975.1
HKU14 YP_005454239.1 YP_005454245.1 YP_005454247.1 YP_005454248.1 YP_005454249.1
CDPHE15 YP_008439200.1 YP_008439202.1 YP_008439204.1 YP_008439205.1 YP_008439206.1
HKU10 YP_006908641.2 YP_006908642.1 YP_006908644.1 YP_006908645.1 YP_006908646.1
BtRf-AlphaCoV YP_009199789.1 YP_009199790.1 YP_009199792.1 YP_009199793.1 YP_009199794.1
229E ARU07599.1 ARU07601.1 ARU07603.1 ARU07604.1 ARU07605.1
LuchengRn-CoV YP_009336483.1 YP_009336484.1 YP_009336485.1 YP_009336486.1 YP_009336487.1
Ferret-CoV YP_009256195.1 YP_009256197.1 YP_009256199.1 YP_009256200.1 YP_009256201.1
Mink-CoV YP_009019180.1 YP_009019182.1 YP_009019184.1 YP_009019185.1 YP_009019186.1
Bat-CoV-1A YP_001718603.1 YP_001718605.1 YP_001718607.1 YP_001718608.1 YP_001718609.1
HKU8 YP_001718610.1 YP_001718612.1 YP_001718614.1 YP_001718615.1 YP_001718616.1
BtMr-AlphaCoV YP_009199608.1 YP_009199609.1 YP_009199611.1 YP_009199612.1 YP_009199613.1
BtNv-AlphaCoV YP_009201729.1 YP_009201730.1 YP_009201732.1 YP_009201733.1 YP_009201734.1
Porcine-EDV NP_598309.2 NP_598310.1 NP_598312.1 NP_598313.1 NP_598314.1
BtCoV512 YP_001351683.1 YP_001351684.1 YP_001351686.1 YP_001351687.1 YP_001351688.1
HKU2 YP_001552234.1 YP_001552236.1 YP_001552238.1 YP_001552239.1 YP_001552240.1
NL63 YP_003766.2 YP_003767.1 YP_003769.1 YP_003770.1 YP_003771.1
NL63-related YP_009328933.1 YP_009328935.1 YP_009328937.1 YP_009328938.1 YP_009328939.1
FCoV YP_004070193.2 YP_004070194.1 YP_004070197.1 YP_004070198.1 YP_004070199.1
TGE NP_058422.1 NP_058424.1 NP_058426.1 NP_058427.2 NP_058428.1

HLA allele frequency data

We designed and executed a custom R script (HLA_frequencies.R) to scrape all population and human leukocyte antigen (HLA) allele frequency from the Allele Frequency Net Database.

Data analysis

Protein sequence alignments

For each protein class (i.e. ORF1ab, S, M, E, N), all 34 coronavirus sequences were aligned using the web-based Clustal Omega multisequence aligner tool employing default parameters: sequence type [Protein], output alignment format [clustal_num], dealign [false], mBed-like clustering guide-tree [true], mBed-like clustering iteration [true], number of combined iterations [0], maximum guide tree iterations [-1], and maximum HMM iterations [-1].

Binding affinity

Using the above FASTA sequences, we used netchop v3.0 "C-term" model with a cleavage threshold of 0.1 to filter peptides that were not predicted to undergo canonical MHC class I antigen processing via proteasomal cleavage.

To assess binding affinity, we kmerized the protein FASTA into peptides of length 8-12.

sed 's/>.*/|/g' protein_sequence.fasta > protein_sequence_with_pipe.fasta

perl -pe 's/\s+//g' protein_sequence_with_pipe.fasta > protein_no_space.fasta

From the netchop v3.0 output, we further processed the file to get rid of NA lines and added row numbers as a column.

sed '/NA/d' netchop_out.txt > netchop_out_parsed.txt

awk -F'\t' 'NR>1{$0=$0"\t"NR-1} 1' netchop_out_parsed.txt > netchop_numbered.txt

Using these 2 files, we ran a custom python script (faster_kmer.py) to generate a list of peptides.

python3 faster_kmer.py > output.pep

sed '/|/d' output.pep > output_filt.pep

Binding affinity of these peptides were calculated using netMHCpan v4.0 (73) with the ‘-BA’ option to include binding affinity predictions and the ‘-l’ option to specify peptides of lengths 8-12.