migrate from cgdsr to cBioPortalData
kmezhoud opened this issue · 4 comments
Dear all,
I suppose that all packages depending on cgdsr
will use cBioPortalData
.
Concretely,
1- Is there an equivalent to theses 6 commands?
2- The structure of data in cgdsr
remain the same in cBioPortalData
with:
- Studies, Cases with clinical data, and Genetic profiles
- The references of cases and genetic profiles tables, like "gbm_tcga_pub"
Rapidly I saw that getCancerStudies
is mutated by getStudies
...
Thanks
Karim
cgds ← CGDS("http://cbioportal.org/public-portal/”)
Studies ← getCancerStudies(cgds)
GenProf ← getGeneticProfiles(cgds, "gbm_tcga_pub")
Cases ← getCaseLists(cgds,"gbm_tcga_pub")
ClinData← getClinicalData(cgds,"gbm_tcga_pub_all")
ProfData← getProfileData(cgds, "NF1",“gbm_tcga_pub_mrna", "gbm_tcga_pub_all")
Hi Karim! @kmezhoud
I hope you're well.
cBioPortalData
is not a direct migration from cgdsr
.
It is mainly an implementation to facilitate data download from the bulk tarballs and the API via cBioDataPack
and cBioPortalData
, respectively.
Please see the vignette for developers here:
https://waldronlab.io/cBioPortalData/articles/cBioPortalRClient.html
Feel free to post any further questions.
Best regards,
Marcel
Dear Ramos,
Thanks!
Here I will try to compare this two packages and understand the different approaches?
If I resume, the main notes are:
cgdsr
andcBioPortalData
use the same hostnamehttp://www.cbioportal.org/
- in selected example,
cBioportalData
returns empty data compared tocgdsr
. - querying Clinical data is related to
SampleListId
. cBioportalData does not usesampleListId
to query ClinicalData genPanelId
is associated toStudyId
orSampleListId
ormolecularProfileId
?
I tried to get mutation data of some genes using Entrez or Symbole but without succes.
Please How to do to get molecularData if we know sampleListId
, molecularProfileId
and GeneList
?
StudyId
remains optional since sampleListId
and molecularProfilesId
are unique.
Thanks,
Say hello to Levis :-).
Karim
get Studies
library(cgdsr)
cgds<-CGDS("http://www.cbioportal.org/")
getCancerStudies.CGDS(cgds) %>%
pull(cancer_study_id) %>%
sort() %>%
head()
[1] "acbc_mskcc_2015" "acc_2019" "acc_tcga" "acc_tcga_pan_can_atlas_2018"
[5] "acyc_fmi_2014" "acyc_jhu_2016"
library(dplyr)
library(cBioPortalData)
cbio <- cBioPortal(
hostname = "www.cbioportal.org",
protocol = "https",
api. = "/api/api-docs"
)
getStudies(cbio) %>%
pull(studyId) %>%
sort() %>%
head()
[1] "acbc_mskcc_2015" "acc_2019" "acc_tcga" "acc_tcga_pan_can_atlas_2018"
[5] "acyc_fmi_2014" "acyc_jhu_2016"
As you can see the two packages use the same hostname but with different protocol (insecure and secure).
They return the same list of Studies with the same dataframe/tibble dataset.
get Cases & Clinical Data
mycase <- getCaseLists.CGDS(cgds,cancerStudy = "gbm_tcga_pub") %>%
pull(case_list_id) %>%
first()
chr [1:15] "gbm_tcga_pub_all" "gbm_tcga_pub_expr_classical" "gbm_tcga_pub_expr_mesenchymal" "gbm_tcga_pub_expr_neural" ...
## get Clinical Data, we need to specify the case ID or Sample list ID
getClinicalData.CGDS(x = cgds, caseList = mycase) %>%
str()
[1] "gbm_tcga_pub_all" "gbm_tcga_pub_expr_classical" "gbm_tcga_pub_expr_mesenchymal"
[4] "gbm_tcga_pub_expr_neural" "gbm_tcga_pub_expr_proneural" "gbm_tcga_pub_cna"
[7] "gbm_tcga_pub_methylation_all" "gbm_tcga_pub_methylation_hm27" "gbm_tcga_pub_microrna"
[10] "gbm_tcga_pub_mrna" "gbm_tcga_pub_cnaseq" "gbm_tcga_pub_sequenced"
[13] "gbm_tcga_pub_sequenced_nohyper" "gbm_tcga_pub_sequenced_nottreated" "gbm_tcga_pub_sequenced_treated"
In cgdsr
User has to specify casesId
or sampleListId
to get clinical data.
#getSampleInfo(api = cbio, studyId = "gbm_tcga_pub", projection = c("SUMMARY", "ID", "DETAILED", "META"))
# get Cases or Sample list ID
myCase_cbio <- sampleLists(api = cbio, studyId = "gbm_tcga_pub") %>%
pull(sampleListId) %>%
str()
chr [1:15] "gbm_tcga_pub_all" "gbm_tcga_pub_expr_classical" "gbm_tcga_pub_expr_mesenchymal" "gbm_tcga_pub_expr_neural" ...
## get Clinical data
clinicalData(api = cbio, studyId = "gbm_tcga_pub") %>%
str()
tibble [206 × 24] (S3: tbl_df/tbl/data.frame)
$ patientId : chr [1:206] "TCGA-02-0001" "TCGA-02-0003" "TCGA-02-0004" "TCGA-02-0006" ...
$ DFS_MONTHS : chr [1:206] "4.504109589" "1.315068493" "10.32328767" "9.928767123" ...
$ DFS_STATUS : chr [1:206] "1:Recurred" "1:Recurred" "1:Recurred" "1:Recurred" ...
$ KARNOFSKY_PERFORMANCE_SCORE: chr [1:206] "80.0" "100.0" "80.0" "80.0" ...
$ OS_MONTHS : chr [1:206] "11.60547945" "4.734246575" "11.34246575" "18.34520548" ...
$ OS_STATUS : chr [1:206] "1:DECEASED" "1:DECEASED" "1:DECEASED" "1:DECEASED" ...
$ PRETREATMENT_HISTORY : chr [1:206] "YES" "NO" "NO" "NO" ...
$ PRIOR_GLIOMA : chr [1:206] "NO" "NO" "NO" "NO" ...
$ SAMPLE_COUNT : chr [1:206] "1" "1" "1" "1" ...
$ SEX : chr [1:206] "Female" "Male" "Male" "Female" ...
$ sampleId : chr [1:206] "TCGA-02-0001-01" "TCGA-02-0003-01" "TCGA-02-0004-01" "TCGA-02-0006-01" ...
$ ACGH_DATA : chr [1:206] "YES" "YES" "NO" "YES" ...
$ CANCER_TYPE : chr [1:206] "Glioblastoma Multiforme" "Glioblastoma Multiforme" "Glioblastoma Multiforme" "Glioblastoma Multiforme" ...
$ CANCER_TYPE_DETAILED : chr [1:206] "Glioblastoma Multiforme" "Glioblastoma Multiforme" "Glioblastoma Multiforme" "Glioblastoma Multiforme" ...
$ COMPLETE_DATA : chr [1:206] "YES" "YES" "NO" "YES" ...
$ FRACTION_GENOME_ALTERED : chr [1:206] "0.2459" "0.1480" NA "0.2391" ...
$ MRNA_DATA : chr [1:206] "YES" "YES" "YES" "YES" ...
$ MUTATION_COUNT : chr [1:206] "3" "5" NA NA ...
$ ONCOTREE_CODE : chr [1:206] "GBM" "GBM" "GBM" "GBM" ...
$ SAMPLE_TYPE : chr [1:206] "Primary" "Primary" "Primary" "Primary" ...
$ SEQUENCED : chr [1:206] "YES" "YES" "YES" "YES" ...
$ SOMATIC_STATUS : chr [1:206] "Matched" "Matched" "Matched" "Matched" ...
$ TMB_NONSYNONYMOUS : chr [1:206] "2.36904510899" "3.94840851498" NA "0.0" ...
$ TREATMENT_STATUS : chr [1:206] "Untreated" "Untreated" "Untreated" "Untreated" ...
In cBioPortalData
we can get Clinical data without specifying sampleListId
. In this case we get all clinical data for all molecularProfilesId
get Genetic Profiles or Molecular Profiles
getGeneticProfiles.CGDS(cgds,cancerStudy = "gbm_tcga_pub" ) %>%
select(genetic_profile_id, genetic_profile_name, everything()) %>%
str()
'data.frame': 10 obs. of 6 variables:
$ genetic_profile_id : chr "gbm_tcga_pub_cna_rae" "gbm_tcga_pub_cna_consensus" "gbm_tcga_pub_mutations" "gbm_tcga_pub_methylation_hm27" ...
$ genetic_profile_name : chr "Putative copy-number alterations (RAE)" "Putative copy-number alterations (Consensus)" "Mutations" "Methylation (HM27)" ...
$ genetic_profile_description : chr "Putative copy-number calls for all genes in 203 GBM cases. Copy number calls were determined from the Agilent 2"| __truncated__ "Putative copy-number calls for genes implicated in glioblastoma (206 cases). These calls were used for the path"| __truncated__ "Mutation data for targeted sequencing in 91 primary glioblastoma tumor/normal pairs (Phases I/II of the TCGA gl"| __truncated__ "Methylation beta-values (Infinium HumanMethylation27 platform). For genes with multiple methylation probes, the"| __truncated__ ...
$ cancer_study_id : int 100 100 100 100 100 100 100 100 100 100
$ genetic_alteration_type : chr "COPY_NUMBER_ALTERATION" "COPY_NUMBER_ALTERATION" "MUTATION_EXTENDED" "METHYLATION" ...
$ show_profile_in_analysis_tab: chr "true" "true" "true" "false" ...
molecularProfiles(api = cbio, studyId = "gbm_tcga_pub") %>%
select(molecularProfileId, name, everything()) %>%
str()
tibble [10 × 8] (S3: tbl_df/tbl/data.frame)
$ molecularAlterationType : chr [1:10] "COPY_NUMBER_ALTERATION" "COPY_NUMBER_ALTERATION" "MUTATION_EXTENDED" "METHYLATION" ...
$ datatype : chr [1:10] "DISCRETE" "DISCRETE" "MAF" "CONTINUOUS" ...
$ name : chr [1:10] "Putative copy-number alterations (RAE)" "Putative copy-number alterations (Consensus)" "Mutations" "Methylation (HM27)" ...
$ description : chr [1:10] "Putative copy-number calls for all genes in 203 GBM cases. Copy number calls were determined from the Agilent 2"| __truncated__ "Putative copy-number calls for genes implicated in glioblastoma (206 cases). These calls were used for the path"| __truncated__ "Mutation data for targeted sequencing in 91 primary glioblastoma tumor/normal pairs (Phases I/II of the TCGA gl"| __truncated__ "Methylation beta-values (Infinium HumanMethylation27 platform). For genes with multiple methylation probes, the"| __truncated__ ...
$ showProfileInAnalysisTab: logi [1:10] TRUE TRUE TRUE FALSE FALSE TRUE ...
$ patientLevel : logi [1:10] FALSE FALSE FALSE FALSE FALSE FALSE ...
$ molecularProfileId : chr [1:10] "gbm_tcga_pub_cna_rae" "gbm_tcga_pub_cna_consensus" "gbm_tcga_pub_mutations" "gbm_tcga_pub_methylation_hm27" ...
$ studyId : chr [1:10] "gbm_tcga_pub" "gbm_tcga_pub" "gbm_tcga_pub" "gbm_tcga_pub" ...
get Profile Data or molecular Data (mRNA expression) for specific gene list Entrez/Hugo Symbol
library(tictoc)
tic("cgdsr:")
getProfileData.CGDS(x = cgds,
genes = c("NF1", "TP53", "ABL1"),
geneticProfiles = "gbm_tcga_pub_mrna",
caseList = "gbm_tcga_pub_all") %>%
head()
toc()
cgdsr:: 0.515 sec elapsed
get gene Entrez ID from gene Hugo Symbol
# get all genPanelId
all_genePanelId <- genePanels(api = cbio) %>% pull(genePanelId)
## get all Genes entrez/symbol from all genePanelID, rm duplicates
all_genes_tbl <- lapply(X =all_genePanelId, function(x) getGenePanel(api = cbio, genePanelId = x)) %>%
bind_rows() %>%
distinct()
# group_by(entrezGeneId, hugoGeneSymbol) %>%
# filter(n()>1) %>%
# summarize(n=n(), .groups = "rowwise")
Our_gene_entrez <- all_genes_tbl %>%
filter(hugoGeneSymbol %in% c("NF1", "TP53", "ABL1")) %>%
pull(entrezGeneId)
## [1] 7157 25 4763
tic("cBioPortalData")
molecularData(api = cbio,
molecularProfileIds = "gbm_tcga_pub_mrna",
entrezGeneIds = Our_gene_entrez,
sampleIds = "gbm_tcga_pub_all")
toc()
named list()
cBioPortalData: 0.178 sec elapsed
The output is empty.
Try cBioPOrtalData
as mentioned in issue #30.
## with Enterez
cBioPortalData(
api = cbio,
studyId = "gbm_tcga",
#genePanelId = "AmpliSeq",
genes = Our_gene_entrez, #c("NF1", "P53", "BRCA1", "BRCA2"),
molecularProfileIds = "gbm_tcga_pub_mrna",
#sampleListId = "gbm_tcga_pub_all",
sampleIds = "gbm_tcga_pub_all",
by = "entrezGeneId" #, "hugoGeneSymbol"
)
Erreur dans split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
la taille du groupe est 0 mais la taille des données est > 0
## with Symbol
cBioPortalData(
api = cbio,
studyId = "gbm_tcga",
#genePanelId = "AmpliSeq",
genes = c("NF1", "P53", "BRCA1", "BRCA2"),
molecularProfileIds = "gbm_tcga_pub_mrna",
#sampleListId = "gbm_tcga_pub_all",
sampleIds = "gbm_tcga_pub_all",
by = "hugoGeneSymbol"
)
Erreur dans split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
la taille du groupe est 0 mais la taille des données est > 0
get mRNA expression with cBioPortalData
function and existing genePanelId
gbm <-cBioPortalData(api = cbio,
by = "hugoGeneSymbol",
studyId = "gbm_tcga",
genePanelId = "IMPACT341",
molecularProfileIds = "gbm_tcga_pub_mrna", #c("gbm_tcga_rppa", "gbm_tcga_mrna")
)
gbm@ExperimentList@listData$gbm_tcga_pub_mrna@assays@data@listData[[1]] %>%
as.data.frame() %>%
head
gbm@ExperimentList@listData$gbm_tcga_pub_mrna@assays@data@listData[[1]] %>%
rownames() %>%
grepl(.,c("NF1", "TP53","ABL1"))
[1] FALSE FALSE TRUE
ABL1 exists, but NF1 and TP53 do not exist.
Try with getDataByGenes
## with Symbol
getDataByGenes(
api = cbio,
studyId = "gbm_tcga",
genes =c("NF1", "P53", "ABL1"),
#genePanelId = NA_character_,
by = "hugoGeneSymbol",
molecularProfileIds = "gbm_tcga_pub_mrna",
#sampleListId = "gbm_tcga_pub_all",
sampleIds = "gbm_tcga_pub_all"
)
# named list()
## With Entrez
getDataByGenes(
api = cbio,
studyId = "gbm_tcga",
genes = Our_gene_entrez,
#genePanelId = NA_character_,
by = "entrezGeneId",
molecularProfileIds = "gbm_tcga_pub_mrna",
#sampleListId = "gbm_tcga_pub_all",
sampleIds = "gbm_tcga_pub_all"
)
named list()
ABL1
is not returned!
get mutation
getMutationData.CGDS(x=cgds,
caseList = "getMutationData",
geneticProfile = "gbm_tcga_pub_mutations",
genes = c("NF1", "TP53", "ABL1")) %>%
select(entrez_gene_id, gene_symbol, amino_acid_change, everything()) %>%
head()
getDataByGenes(
api = cbio,
studyId = "gbm_tcga",
genes = Our_gene_entrez,
#genePanelId = NA_character_,
by = "entrezGeneId",
molecularProfileIds = "gbm_tcga_pub_mutations",
#sampleListId = "gbm_tcga_pub_all",
sampleIds = "gbm_tcga_pub_all"
)
Erreur dans byGeneList[mutation] <- mutationData(api, molecularProfileIds[mutation], :
l'argument de remplacement est de longueur nulle
cBioPortalData(
api = cbio,
studyId = "gbm_tcga",
#genePanelId = "AmpliSeq",
genes = c("NF1", "P53", "BRCA1", "ABL1"),
molecularProfileIds = "gbm_tcga_pub_mutations",
#sampleListId = "gbm_tcga_pub_all",
sampleIds = "gbm_tcga_pub_all",
by = "hugoGeneSymbol"
)
Erreur dans byGeneList[mutation] <- mutationData(api, molecularProfileIds[mutation], :
l'argument de remplacement est de longueur nulle
Hi Karim, @kmezhoud
Thank you for this comprehensive comparison!
I can add this to the package as a vignette (with attribution ofc) for those looking to
migrate their code from cgds
to cBioPortalData
.
The examples you provided mixed the use of gbm_tcga
and gbm_tcga_pub
and that's why you were seeing empty responses.
The molecularData
operation could use a bit more flexibility in terms of inputs. I will work on a hugoGeneSymbol
input.
These are lower level functions and are not very user friendly. If you're looking to get to the data straightaway, you can simply
do:
cbio <- cBioPortal()
gbm_pub <- cBioPortalData(cbio, "gbm_tcga_pub", genes = c("NF1", "TP53", "ABL1"), by = "hugoGeneSymbol", molecularProfileIds = "gbm_tcga_pub_mrna")
assay(gbm_pub[["gbm_tcga_pub_mrna"]])
Best regards,
Marcel
Update: I've added the ability to query the API for a table of gene symbols:
cbio <- cBioPortal()
queryGeneTable(cbio,
by = "hugoGeneSymbol",
genes = c("NF1", "TP53", "ABL1")
)
and a vignette to allow developers to migrate from cgds
to cBioPortalData
at https://github.com/waldronlab/cBioPortalData/blob/devel/vignettes/cgdsMigration.Rmd
Your feedback is welcome.
Thanks!