New merge files for s3 v1 folder
Closed this issue · 9 comments
Please re-merge the following files using the bs ids present in the Hope-GBM-histologies-base.tsv
file. Note, we copied the CPTAC-GBM sample_id
to the Kids_First_Biospecimen_ID
column as well for ease.
I think there will be changes to all files except possibly RSEM and splice files (latter Bo had generated) between the reruns and cohort update to match Nicole's files.
Hope-and-CPTAC-GBM-gene-expression-rsem-tpm-collapsed.rds
Hope-and-CPTAC-GBM.splice-events-rmats.tsv.gz
Hope-cnv-controlfreec-tumor-only.rds
Hope-cnv-controlfreec.rds
Hope-consensus-mutation.maf.tsv.gz
Hope-fusion-putative-oncogenic.rds
Hope-gene-counts-rsem-expected_count-collapsed.rds
Hope-gene-counts-rsem-expected_count.rds
Hope-gene-expression-rsem-tpm-collapsed.rds
Hope-gene-expression-rsem-tpm.rds
Hope-mutect2-mutation-tumor-only.maf.tsv.gz
Hope-sv-manta.tsv.gz
-
For the
Hope-mutect2-mutation-tumor-only.maf.tsv.gz
file, please also filter OUT variants witht_alt_count == 0
and variants wheret_depth <5
. There was no filtering done in this beta workflow, so setting a small amount here to remove some potential FPs. -
For the
Hope-fusion-putative-oncogenic.rds
, please also include filtering for TPM per #26. -
For methylation, we will not make a merged matrix, but we will replace the current methyl aliquot file for 7316-5928 (Normal) with a Tumor aliquot file once we figure out which came from Thalamus.
Question for @komalsrathi and @mkoptyra - do we want a consensus CNV file for the samples profiled by T/N CNV?
Anything I am missing @zzgeng @komalsrathi @mkoptyra ?
Seems like the fusion and expression files have recently been updated? Aug 14 is the last update and I have files from Jan, so I am going to update all files.
@jharenza, I am confused reg. the following so could you please clarify which filenames are appropriate:
- Tumor-only MAF:
Hope-tumor-only-snv-mutect2.vep.maf.tsv.gz
orHope-mutect2-mutation-tumor-only.maf.tsv.gz
- T/N MAF:
Hope-snv-consensus-plus-hotspots.maf.tsv.gz
orHope-consensus-mutation.maf.tsv.gz
Sorry, I was looking at the data I had currently, and they had RDS, so may have been outdated anyway. Would prefer:
- Tumor-only MAF:
Hope-tumor-only-snv-mutect2.maf.tsv.gz
(can drop the VEP since all have VEP annots) - T/N MAF:
Hope-snv-consensus-plus-hotspots.maf.tsv.gz
For the Hope-mutect2-mutation-tumor-only.maf.tsv.gz file, please also filter OUT variants with t_alt_count == 0 and variants where t_depth <5. There was no filtering done in this beta workflow, so setting a small amount here to remove some potential FPs.
For the Hope-fusion-putative-oncogenic.rds, please also include filtering for TPM per #26.
Done!
@zzgeng I have remerged all files and uploaded them to s3 + md5sums. Could you please check if everything looks ok? Thanks!
@komalsrathi md5sums all check. I did update the base histologies to update the methylation samples just now, so I think we are good there - now onto @zzgeng to confirm the files have the samples required and to finish subtyping!
Hello @komalsrathi ! Thank you for doing this! I think there are some columns missing in cnv-controlfreec/cnv-controlfreec-tumor-only.rds
files. In OpenPedCan, there are 11 columns. In hope dataset, we only have four, Kids_First_Biospecimen_ID
, copy number
, status
, and gene symbol
. For the tp53 classifier, one script requires gene coordinates to proceed. I wonder if you can add this information. Thank you so much!
Let me check, I may have removed them in the process of adding the gene symbols. I'll add them back in and update.
Updated on s3.
I added all columns (Kids_First_Biospecimen_ID
, chr
, start
, end
, copy number
, status
, genotype
, uncertainty
, WilcoxonRankSumTestPvalue
, KolmogorovSmirnovPvalue
) that are present in the OpenPedCan file except for tumor_ploidy
, not sure how it was added in the OpenPedCan file. There is also an additional column called gene_symbol
that I mapped to the coordinates for my downstream analyses, so you can ignore that.
Please let me know if this will work for the classifier.
Everything looks good, classifier ran and results were reproducible