These are iPython notebooks and associated scripts that were used in a manuscript exploring the classification of human tissues via proteomics data. There are a variety of notebooks which are used to make figures for the manuscript. Each starts from the same basic dataframe. Therefore, the repository should be fairly self-contained, with this df as a starting point. See requirements.txt for required software packages and versions.
Please upzip FullPeptideQuant.txt.zip in your local repository to reproduce all the results.
- Feature selection and classifier parameter tuning
- Trained_Models directory contains the compressed finalized models, which can be loaded directly into a notebook. This way the finalized models can be used without running the grid searches, which are time-intensive.
- Code to load and pre-process data
- Train and test various classifiers. The goal of these classifiers is to see whether we can correctly predict the source tissue of a proteomics sample.
- Classifying blood plasma and serum with increasingly larger training sets
- Code to produce tSNE and PCA plots for train and test data
- Includes plots showing diseased samples as open circles, and plots with cell line datasets
- Analysis of peptide variability across tissues. Used to create a figure displaying the distinct expression patterns of four archetypal peptides.
- Code corresponding to 'Minimal Classifiers'
- Testing how classifiers perform on test data, with low abundance peptides removed
- Contains utility functions to perform basic classification processes, including cross-validation and grid searches
- Script to create dataframes
- Ensures all dataframes go through the same cleaning and transformation steps:
- Log2 transform
- Impute missing values
- Remove peptides not contained in at least 5 samples of at least 1 tissue
- Median normalize