New merge files for s3 v1 folder

Question

New merge files for s3 v1 folder

Closed this issue a year ago · 9 comments

Please re-merge the following files using the bs ids present in the Hope-GBM-histologies-base.tsv file. Note, we copied the CPTAC-GBM sample_id to the Kids_First_Biospecimen_ID column as well for ease.

I think there will be changes to all files except possibly RSEM and splice files (latter Bo had generated) between the reruns and cohort update to match Nicole's files.

Hope-and-CPTAC-GBM-gene-expression-rsem-tpm-collapsed.rds
Hope-and-CPTAC-GBM.splice-events-rmats.tsv.gz
Hope-cnv-controlfreec-tumor-only.rds
Hope-cnv-controlfreec.rds
Hope-consensus-mutation.maf.tsv.gz
Hope-fusion-putative-oncogenic.rds
Hope-gene-counts-rsem-expected_count-collapsed.rds
Hope-gene-counts-rsem-expected_count.rds
Hope-gene-expression-rsem-tpm-collapsed.rds
Hope-gene-expression-rsem-tpm.rds
Hope-mutect2-mutation-tumor-only.maf.tsv.gz
Hope-sv-manta.tsv.gz

For the Hope-mutect2-mutation-tumor-only.maf.tsv.gz file, please also filter OUT variants with t_alt_count == 0 and variants where t_depth <5. There was no filtering done in this beta workflow, so setting a small amount here to remove some potential FPs.
For the Hope-fusion-putative-oncogenic.rds, please also include filtering for TPM per #26.
For methylation, we will not make a merged matrix, but we will replace the current methyl aliquot file for 7316-5928 (Normal) with a Tumor aliquot file once we figure out which came from Thalamus.

Question for @komalsrathi and @mkoptyra - do we want a consensus CNV file for the samples profiled by T/N CNV?

Anything I am missing @zzgeng @komalsrathi @mkoptyra ?

Answer 1 · 2023-08-30T19:02:43.000Z

Seems like the fusion and expression files have recently been updated? Aug 14 is the last update and I have files from Jan, so I am going to update all files.

Answer 2 · 2023-08-30T19:42:09.000Z

@jharenza, I am confused reg. the following so could you please clarify which filenames are appropriate:

Tumor-only MAF: Hope-tumor-only-snv-mutect2.vep.maf.tsv.gz or Hope-mutect2-mutation-tumor-only.maf.tsv.gz
T/N MAF: Hope-snv-consensus-plus-hotspots.maf.tsv.gz or Hope-consensus-mutation.maf.tsv.gz

Answer 3 · 2023-08-30T19:55:36.000Z

Sorry, I was looking at the data I had currently, and they had RDS, so may have been outdated anyway. Would prefer:

Tumor-only MAF: Hope-tumor-only-snv-mutect2.maf.tsv.gz (can drop the VEP since all have VEP annots)
T/N MAF: Hope-snv-consensus-plus-hotspots.maf.tsv.gz

Answer 4 · 2023-08-30T22:00:58.000Z

For the Hope-mutect2-mutation-tumor-only.maf.tsv.gz file, please also filter OUT variants with t_alt_count == 0 and variants where t_depth <5. There was no filtering done in this beta workflow, so setting a small amount here to remove some potential FPs.

For the Hope-fusion-putative-oncogenic.rds, please also include filtering for TPM per #26.

Done!

@zzgeng I have remerged all files and uploaded them to s3 + md5sums. Could you please check if everything looks ok? Thanks!

Answer 5 · 2023-08-30T22:42:12.000Z

@komalsrathi md5sums all check. I did update the base histologies to update the methylation samples just now, so I think we are good there - now onto @zzgeng to confirm the files have the samples required and to finish subtyping!

Answer 6 · 2023-08-31T01:09:08.000Z

Hello @komalsrathi ! Thank you for doing this! I think there are some columns missing in cnv-controlfreec/cnv-controlfreec-tumor-only.rds files. In OpenPedCan, there are 11 columns. In hope dataset, we only have four, Kids_First_Biospecimen_ID, copy number, status, and gene symbol. For the tp53 classifier, one script requires gene coordinates to proceed. I wonder if you can add this information. Thank you so much!

Answer 7 · 2023-08-31T12:13:46.000Z

Let me check, I may have removed them in the process of adding the gene symbols. I'll add them back in and update.

Answer 8 · 2023-08-31T12:30:30.000Z

Updated on s3.

I added all columns (Kids_First_Biospecimen_ID, chr, start, end, copy number, status, genotype, uncertainty, WilcoxonRankSumTestPvalue, KolmogorovSmirnovPvalue) that are present in the OpenPedCan file except for tumor_ploidy, not sure how it was added in the OpenPedCan file. There is also an additional column called gene_symbol that I mapped to the coordinates for my downstream analyses, so you can ignore that.

Please let me know if this will work for the classifier.

Answer 9 · 2023-08-31T18:32:17.000Z

Everything looks good, classifier ran and results were reproducible