- Alexander Smith
- Kelvin Lu
- Archange Giscard Destiné
- Python 3
To install, run:
$ install.sh
This creates a Python virtual env in the directory venv
.
Activate the environment with the command:
$ source venv/bin/activate
Run the download_data.sh
script.
This downloads the Uniprot Swiss-Prot FASTA data and unpacks it to
data/uniprot_sprot.fasta
.
Run cs747-parse-uniprot-fasta
This parses Uniprot FASTA data into a Pandas dataframe and saves it as
a CSV. This creates the file, data/seq.csv
Run cs747-build-taxonomy-db
This populates the taxonomy database from sequence data contained in
the sequence CSV by looking up organism entries from the Uniprot
Taxonomy REST API. It then saves it as a Python Pickle file, named
data/taxonomy_db.pickle
.
Run cs747-label-data
This labels the sequence data and balances the data for our classes.
It saves the labeled data as a new CSV, named
data/labeled_sequences.csv
.