/public_cancer_data

This is a repo of cancer data from TCGA and CPTAC post processed in a uniform way to filter out outliers & make datasets that are ameanable to downstream statistical analsyes

Primary LanguageJupyter Notebook

PanCan data

This is a dataset of RNA, CpG, and protein data for the CPTAC3 patient datasets. We were using this in a few studies and realised that it may be useful to other people. 

Pan can data selection

Data were downloaded for all CPTAC3 studies from the CPATC portal, specifically, the studies with at least five cases and included a protein assembly. For these studies the clinical and biospecimen data were downloaded along with the protein summary file, containing the processed and normalised protein data via the CPTAC data processing pipeline. The accompanying gene expression and DNA methylation were downloaded from TCGA by selecting CPTAC3 data, filtering for Solid Tissue Normal samples, or Primary Tumour samples, transcriptome profiling counts data, and DNA methylation array. The data were downloaded 18th of July 2023. The cancers were filtered to include only the following primary diseases: Acute Myeloid Leukemia, Breast Invasive Carcinoma, Clear Cell Renal Cell Carcinoma, Head and Neck Squamous Cell Carcinoma, Lung Adenocarcinoma, Lung Squamous Cell Carcinoma, Pancreatic Ductal Adenocarcinoma, Uterine Corpus Endometrial Carcinoma omitting samples not associated with a primary disease or un-descriptive classifications such as “Not Clear Cell Renal Cell Carcinoma”. Of the selected cancers, only Clear Cell Renal Cell Carcinoma, Head and Neck Squamous Cell Carcinoma, Lung Adenocarcinoma, Lung Squamous Cell Carcinoma, Pancreatic Ductal Adenocarcinoma, Uterine Corpus Endometrial Carcinoma had RNAseq and DNA methylation data.  Cases were retained for a cancer if the case had an entry in the clinical information supplied by the CPTAC, and TCGA portals. The stage of the tumours were consolidated to four tumour stage classifications: 1) “TumorStage”, Stage I (Stage I, Stage IA, Stage IB, Stage IA3) Stage II (Stage II, Stage IIA, Stage IIB), Stage III (Stage III, Stage IIIA, Stage IIIB) and Stage IV (Stage IV, Stage IVA, Stage IVB). This classification was further grouped into early (Stage I and Stage II) and late stage (Stage III and Stage IV). We found the majority of cases had multiple files associated, these were further filtered by the biospecimen type, reducing to only include Solid Tissue specimens. 

DNA methylation processing

Data were further filtered to check for outliers prior to running differential analyses. For the DNA methylation data, beta values of 1.0 were replaced with 0.999 and beta values of 0 were replaced with 0.001. CpGs with an average methylation across all samples of > 5% and <95% were retained. Correlation between samples was then calculated using Pearson’s correlation and samples with a Pearsons’ correlation > 3s.d. from the median correlation for each sample type were removed (with the exception from Pancreatic Ductal Adenocarcinoma and Lung Adenocarcinoma where a cutoff of 2 s.d. was used as PCA showed outlier samples affected the PC’s). CpG samples with missing data in 50% of samples were also removed, before null values were replaced with 0.001.  

Gene expression processing

For RNAseq data, genes with mean counts <= 10 across samples were removed before calculating the correlation between samples for each sample type (i.e. tumour and solid tissue normal), those with an median sample Pearson’s correlation > 3 s.d deviation from the median were removed. RNA samples with missing data in 50% of samples were also removed, before null values were replaced with 0’s. Each cancer was visually inspected using PCA to confirm separability within the cancer between tumour and normal samples. Finally, for patients whereby more than one sample passed the QC thresholds, only one sample was retained for tumour, normal, and DNA methylation and gene expression.  

Protein data processing

For the protein data, the data as processed by CPTAC were used, these are normalised and have been assigned to genes in a consistent fashion across cancers. For Lung Adenocarcinoma, samples with case IDs not fitting the standard convention were omitted namely those (i.e. 11LU013_Tumor_Protein_CPT0053040004, 11LU016_Tumor_Protein_CPT0052940004, 11LU022_Tumor_Protein_CPT0052170004, 11LU035_Tumor_Protein_CPT0051690004). Genes with 0’s in more than 50% of samples omitted. Correlation between samples for each sample type (i.e. tumour and solid tissue normal), was calculated using Pearson’s correlation. Samples with a median correlation less than the median minus 3 s.d deviation were removed. Missing protein data were imputed using DreamAI ensemble method 66 (https://github.com/WangLab-MSSM/DreamAI). Samples exhibited a high correlation post imputation, with tumour and normal samples clustering distinctly, and as such no protein samples were removed. The supplied protein names were mapped to hgnc symbols using biomart mappings (scibiomart, 1.0.2, https://github.com/ArianeMora/scibiomart), and for those without direct mappings were mapped using the external_synonym. 

Pan can dataset generation

For the PanCan dataset, the filtered and imputed protein, gene expression and DNA methylation datasets were joined by gene name, ensembl ID, and CpG ID respectively. An inner join was used to join on the ID for all datasets. The protein data was mean shifted to centre at 0 for each cancer, then when joined shifted by the minimum across all cancer datasets. 

Pan can differential analysis

Differential analysis was performed for all cancers (not including ccRCC). For the pan-can (Head and Neck Squamous Cell Carcinoma, Lung Adenocarcinoma, Lung Squamous Cell Carcinoma, Pancreatic Ductal Adenocarcinoma), disease was used as a factor in the differential analysis. For pan-can RNA-seq, DESeq2 was used to calculate significant genes between tumour and normal samples using the design matrix was ~disease + condition_id, where condition_id indicates whether the sample was primary tumour or normal. For differential methylation analysis, the MissMethyl pipeline was used, namely we tested for differential methylation using lmFit and eBayes functions from Limma using M values as input (calculated as log2(beta/beta + 1)). Again for the design matrix we used the disease as a factor in the pan-can analysis. Finally for the proteomics differential expression analysis we also use the limma pipeline using the same design matrix.  

Tutorial on data download

See tutorial folder!