Tissue RNA expression from XML file
Closed this issue ยท 5 comments
From an email request
Hope you are having a great week and thank you for making this amazing tool!
We had a query about the functions to get and parse XML data.
Specifically, we wanted to extract the Tissue RNA Expression data from the XML file, and would like to know if any built-in function can do that?
The XML tag for that section is:
<rnaExpression source="HPA" technology="RNAseq" assayType="tissue">
If not for a built-in function, can you please suggest how to achieve this with xml2?
The section that you are looking for in the xml file looks like this:
<rnaExpression source="HPA" technology="RNAseq" assayType="tissue">
<data>
<tissue organ="Connective & Soft tissue" ontologyTerms="UBERON:0001013">Adipose tissue</tissue>
<level type="normalizedRNAExpression" unitRNA="nTPM" expRNA="3.9"/>
<level type="proteinCodingRNAExpression" unitRNA="pTPM" expRNA="5.4"/>
<level type="RNAExpression" unitRNA="TPM" expRNA="4.4"/>
<RNASample sampleId="86" unitRNA="nTPM" expRNA="6" sex="Female" age="80"/>
<RNASample sampleId="115" unitRNA="nTPM" expRNA="1.9" sex="Female" age="45"/>
<RNASample sampleId="137" unitRNA="nTPM" expRNA="4.7" sex="Female" age="57"/>
<RNASample sampleId="329" unitRNA="nTPM" expRNA="4.2" sex="Female" age="74"/>
<RNASample sampleId="331" unitRNA="nTPM" expRNA="2.4" sex="Female" age="59"/>
</data>
<data>
<tissue organ="Endocrine tissues" ontologyTerms="UBERON:0002369">Adrenal gland</tissue>
<level type="normalizedRNAExpression" unitRNA="nTPM" expRNA="4.0"/>
<level type="proteinCodingRNAExpression" unitRNA="pTPM" expRNA="6.6"/>
<level type="RNAExpression" unitRNA="TPM" expRNA="5.2"/>
<RNASample sampleId="87" unitRNA="nTPM" expRNA="4.7" sex="Female" age="62"/>
<RNASample sampleId="88" unitRNA="nTPM" expRNA="3.8" sex="Female" age="36"/>
<RNASample sampleId="89" unitRNA="nTPM" expRNA="3.6" sex="Female" age="63"/>
</data>
...
With xml2
, we just need to construct the right xpath for xml_find_all
to get to the desired location. Something like this would help:
library(xml2)
# Read the XML file
xml <- read_xml("https://www.proteinatlas.org/ENSG00000134057.xml")
# Extract the desired information
rna_tissue_exp <- xml |>
xml_find_all('//rnaExpression[@source="HPA" and @technology="RNAseq" and @assayType="tissue"]') |>
xml_find_all('.//data') |>
as_list()
From there you can choose to extract what you want from the resulting list.
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Chicago
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] xml2_1.3.5
loaded via a namespace (and not attached):
[1] compiler_4.3.1 magrittr_2.0.3 cli_3.6.1 tools_4.3.1
[5] pillar_1.9.0 glue_1.6.2 rstudioapi_0.15.0 curl_5.0.2
[9] utf8_1.2.3 fansi_1.0.4 vctrs_0.6.3 lifecycle_1.0.3
[13] rlang_1.1.1 purrr_1.0.2
Thank you again for providing a solution to reach the XML tags!
Would greatly appreciate information on how to extract the data from each XML/Gene's rnaExpression
as a dataframe?
tissue | sampleId | expRNA | sex | age |
---|---|---|---|---|
Adipose tissue | 86 | 6 | Female | 80 |
. | ||||
. | ||||
. | ||||
Adrenal gland | 87 | 4.7 | Female | 62 |
. | ||||
. | ||||
. |
I think something like this may work for your case. It's not a pretty pipe but it gets the work done.
library(xml2)
# library(dplyr)
# Read the XML file
xml <- read_xml("https://www.proteinatlas.org/ENSG00000134057.xml")
# Extract the desired information
rna_tissue_exp <- xml |>
xml_find_all('//rnaExpression[@source="HPA" and @technology="RNAseq" and @assayType="tissue"]') |>
xml_find_all('.//data')
# Initialize empty lists to store data
tissue_list <- list()
sampleId_list <- list()
expRNA_list <- list()
sex_list <- list()
age_list <- list()
# Loop through each <data> element
for (data_node in rna_tissue_exp) {
# Extract tissue
tissue <- xml_text(xml_find_first(data_node, ".//tissue"))
# Extract sample information
sampleId <- xml_attr(xml_find_all(data_node, ".//RNASample"), "sampleId")
expRNA <- xml_attr(xml_find_all(data_node, ".//RNASample"), "expRNA")
sex <- xml_attr(xml_find_all(data_node, ".//RNASample"), "sex")
age <- xml_attr(xml_find_all(data_node, ".//RNASample"), "age")
# Append to lists
tissue_list <- c(tissue_list, rep(tissue, length(sampleId)))
sampleId_list <- c(sampleId_list, sampleId)
expRNA_list <- c(expRNA_list, expRNA)
sex_list <- c(sex_list, sex)
age_list <- c(age_list, age)
}
# Create data frame
df <- data.frame(
tissue = unlist(tissue_list),
sampleId = unlist(sampleId_list),
expRNA = unlist(expRNA_list),
sex = unlist(sex_list),
age = unlist(age_list)
)
# Print the data frame
print(df)
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Chicago
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.1.2 xml2_1.3.5
loaded via a namespace (and not attached):
[1] utf8_1.2.3 R6_2.5.1 tidyselect_1.2.0 magrittr_2.0.3
[5] glue_1.6.2 tibble_3.2.1 pkgconfig_2.0.3 generics_0.1.3
[9] lifecycle_1.0.3 cli_3.6.1 fansi_1.0.4 vctrs_0.6.3
[13] compiler_4.3.1 rstudioapi_0.15.0 tools_4.3.1 curl_5.0.2
[17] pillar_1.9.0 rlang_1.1.1
Thank you so much! Works great and is elegant enough for our use case :)
I believe dplyr
is not needed for this
rna_tissue_exp = xml |>
xml_find_all('//rnaExpression[@source="HPA" and @technology="RNAseq" and @assayType="tissue"]') |>
xml_find_all('.//data')
Thank you so much! Works great and is elegant enough for our use case :)
I believe
dplyr
is not needed for thisrna_tissue_exp = xml |> xml_find_all('//rnaExpression[@source="HPA" and @technology="RNAseq" and @assayType="tissue"]') |> xml_find_all('.//data')
Thank you. That's what I get for copy-pasting partial codes.