IEDB Protein Tree

Assigning IEDB source antigens and epitopes to their genes and proteins.

Current Success Rates:

Collect the epitope and source antigen data for a species.
Select the best proteome for that species from UniProt.
Assign gene/protein to source antigens and epitopes using BLAST, ARC, and PEPMatch.

To run the entire pipeline:

protein_tree/run.py -t <taxon ID>

protein_tree/run.py -a

Getting the raw epitope and source antigen data can be run separately:

protein_tree/get_data.py -t <taxon ID>

Selecting the best proteome can also be run separately:

protein_tree/select_proteome.py -t <taxon ID>

For each species:

proteome.fasta - selected proteome in FASTA
source_assignments.csv - each source antigen with assigned gene and protein
epitope_assignments.csv - each epitope with its source antigen and assigned protein
[optional] gp_proteome.fasta - the gene priority proteome in FASTA if it exists

For all species:

metrics.csv - the metadata from the build
- Proteome ID
- Proteome Taxon
- Proteome Type
- Source Antgigen Count
- Epitope Count
- Successful Source Assignement (%)
- Successful Epitope Assignment (%)
all_epitope_assinments.csv - combined epitope assignments for every species
all_source_assignments.csv - combined source antigen assignments for every species

Use combine_data.py to merge all the assignments into the all_epitope_assignments.csv and all_source_assignments.csv files.

danielmarrama/protein_tree