CSE 283 / BENG 203 Project: Spring 2021
Repository structure
notebooks/
Contains Jupyter notebooks that are used in processing the data, generating some of the plots, running classifiers, etc.
scripts/
Contains one-off scripts used for various tasks (generating some of the plots,
computing metrics of the dataset, etc.). The PCA_tSNE_analysis.R
and
cf_matrix.py
scripts were written by Sara Rahiminejad.
data/
sources
All of the files located in this folder are associated with Zhou et al. 2019.
The files prefixed with GSE131512_
were
downloaded from the "supplementary files" section on
NCBI GEO (accession number GSE131512). Some additional files come from other sources; these are described below.
preselectedEnsemblIDList.txt
This list was downloaded from this GitHub repository, which was linked in the course slides. This list contains 750 Ensembl IDs, which we understand correspond to the gene names listed in Table S5 of the supplementary materials of Zhou et al. 2019.
While we were working on this project, Ensembl was down, so for the sake of convenience we just used this list rather than convert the gene names in Table S5 to Ensembl IDs ourselves.
limma_output.tsv
This was generated by running limma on the 67 training cancer samples.
Metadata files
- The
geo_metadata.tsv
file was converted from theGSE131512_metaData.xlsx
file to simplify loading this data in scripts. - The
table_s3.tsv
file was copied from Table S3 in the supplementary materials of Zhou et al. 2019 to simplify mapping sample IDs to their status (C-R, C-N, N). - The
table_s4.tsv
file was copied from Table S4 in the supplementary materials of Zhou et al. 2019 to simplify mapping sample IDs from patients with cancer to their particular cancer subtype, chemotherapy status, and other provided medical metadata. - The
merged_metadata.tsv
file is created in the01-MergeMetadata.ipynb
notebook, and includes all of the information from the GEO, Table S3, and Table S4 metadata files. - The
final_metadata.tsv
file is created in the02-TrainTestSplit.ipynb
notebook, and includes everything inmerged_metadata.tsv
plus assignments of each sample to training vs. test data. Please see the train/test split notebook for more information.
Note about "starting" data
We note that, although a full analysis using this data would ideally start from scratch with the raw sequencing data, we have instead started our work with the gene frequency tables derived from the sequencing data. We have done this for the sake of convenience due to the time constraints inherent to this project.
System Requirements
All of the notebooks/
, with the exception of the CoDaCoRe notebook, are best
run using a QIIME 2 2020.6 notebook with Songbird installed.
The CoDaCoRe notebook should probably be run from a separate conda environment with the py-codacore package installed. See https://github.com/egr95/py-codacore.
The python and R scripts in scripts/
rely on various packages; please consult
the scripts' first few lines for details.