The question about metadata
Closed this issue · 7 comments
Hello, I would like to ask, I have several organic acids as a phenotype, how should I prepare my "metadata"?
$ lccZT0.sorted.bam : num 3610 21900 19400 10900 59600 2330000 49600 18400 1700000 6530 ...
$ lccZT4.sorted.bam : num 13000 21100 29600 4150 30600 2500000 12300 9710 2200000 674 ...
$ lccZT8.sorted.bam : num 9290 28000 20900 8920 26500 2370000 12000 9350 1220000 2270 ...
$ lccZT12.sorted.bam: num 8590 31200 32000 8400 37000 2340000 16200 6350 2540000 7740 ...
$ lccZT16.sorted.bam: num 13800 17000 18700 6720 27200 2860000 14000 9960 2050000 5060 ...
Hi, @409199
Thank you for using BioNERO. ;-)
Sample metadata must be formatted in a data frame with sample names in row names and any relavant variables in columns. Sample names must match the colnames()
of your expression matrix.
In your case, it seems like transposing the data frame would do the job (with the t()
function).
However, instead of storing the expression matrix and sample metadata data frame in separate objects, I'd strongly recommend working with SummarizedExperiment
objects (see docs here), because it's way easier to have all data in a single object.
Best,
Fabricio
First of all, thank you very much for your kindly reply.
In the process of applying BioNERO, I found that my phenotypic data is not the same as the data in your tutorial. Your phenotypic data is from each organ, and our phenotypic data is the content of several organic acids, so I don't know how to prepare metadata and correlate the module with organic acid content according to your tutorial.
My exp
Pyruvic_acid e_Hydroxypropanoic_acid e_Aminobutyric_acid
lccZT0.sorted.bam 3610 21900 19400
lccZT4.sorted.bam 13000 21100 29600
lccZT8.sorted.bam 9290 28000 20900
lccZT12.sorted.bam 8590 31200 32000
Hi,
Could you please store your expression data and sample metadata in a SummarizedExperiment
object? The way it is, it's very hard to figure out what your data looks like, especially because you're pasting screenshots, and not a reprex.
Right now, I don't understand what you have as gene IDs, and if sample IDs match between your expression matrix and your sample metadata data frame.
Best,
Fabricio
First of all, thank you very much for your quick reply.
I've stored the expression data and sample metadata in the SummarizedExperiment object. After carefully reading the other questions you answered, I found that the core problem is that my variables are the content of organic acids as a continuous variable, while the categorical variables of your variables are roots, stems, and leaves. Can continuous variables be correlated with BioNERO? If so, how to do that?
Thank you for your kind help.
Hi,
Continuous variables can also be handled by BioNERO in the same way categorical or ordinal variables are; you add them as columns in the colData slot of your SummarizedExperiment
object.
The code below was extracted from the vignette, but I edited the code to create a simulated continuous variable named compound_content
:
set.seed(123)
suppressPackageStartupMessages({
library(BioNERO)
library(SummarizedExperiment)
})
data(zma.se)
final_exp <- exp_preprocess(
zma.se, min_exp = 10, variance_filter = TRUE, n = 2000
)
#> Number of removed samples: 1
sft <- SFT_fit(final_exp, net_type = "signed hybrid", cor_method = "pearson")
#> Warning: executing %dopar% sequentially: no parallel backend registered
#> Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
#> 1 3 0.293000 0.27100 0.1180 384.0 386.0 689
#> 2 4 0.000141 -0.00465 -0.2750 290.0 272.0 584
#> 3 5 0.210000 -0.20100 0.0542 227.0 202.0 509
#> 4 6 0.427000 -0.35900 0.2990 184.0 155.0 452
#> 5 7 0.583000 -0.48400 0.4780 153.0 121.0 407
#> 6 8 0.665000 -0.58300 0.5720 129.0 96.0 370
#> 7 9 0.697000 -0.66500 0.6110 111.0 77.8 339
#> 8 10 0.786000 -0.71800 0.7260 95.8 64.1 313
#> 9 11 0.787000 -0.77600 0.7310 83.8 53.4 290
#> 10 12 0.821000 -0.82800 0.7810 73.9 44.7 270
#> 11 13 0.857000 -0.86700 0.8290 65.6 37.5 252
#> 12 14 0.884000 -0.89500 0.8660 58.6 31.5 236
#> 13 15 0.890000 -0.91400 0.8710 52.7 26.7 221
#> 14 16 0.884000 -0.93900 0.8630 47.6 22.9 208
#> 15 17 0.886000 -0.96300 0.8630 43.1 19.7 196
#> 16 18 0.896000 -0.97500 0.8740 39.2 17.0 185
#> 17 19 0.905000 -0.98400 0.8840 35.8 14.8 175
#> 18 20 0.914000 -0.99300 0.8930 32.8 12.8 166
net <- exp2gcn(
final_exp, net_type = "signed hybrid", SFTpower = sft$power,
cor_method = "pearson"
)
#> ..connectivity..
#> ..matrix multiplication (system BLAS)..
#> ..normalization..
#> ..done.
# Add a fake continuous variable in the colData slot
final_exp$compound_content <- rnorm(ncol(final_exp), 20, 2)
colData(final_exp)
#> DataFrame with 27 rows and 2 columns
#> Tissue compound_content
#> <character> <numeric>
#> SRX339756 endosperm 22.3282
#> SRX339757 endosperm 19.6960
#> SRX339758 endosperm 25.0386
#> SRX339762 endosperm 18.5401
#> SRX339764 endosperm 24.2687
#> ... ... ...
#> SRX2792107 whole_seedling 20.3020
#> SRX2792108 whole_seedling 15.3818
#> SRX2792102 whole_seedling 18.0599
#> SRX2792103 whole_seedling 18.7434
#> SRX2792104 whole_seedling 20.6909
me_trait <- module_trait_cor(exp = final_exp, MEs = net$MEs)
me_trait
#> ME trait cor pvalue group
#> 1 MEblue endosperm 0.48007104 0.0112681646 Tissue
#> 2 MEblue pollen 0.30284517 0.1246651473 Tissue
#> 3 MEblue whole_seedling -0.66012782 0.0001791887 Tissue
#> 4 MEcyan endosperm 0.17353097 0.3866980847 Tissue
#> 5 MEcyan pollen 0.24563232 0.2168395572 Tissue
#> 6 MEcyan whole_seedling -0.34505954 0.0779446760 Tissue
#> 7 MEgrey endosperm 0.47614330 0.0120525998 Tissue
#> 8 MEgrey pollen 0.22461961 0.2599972506 Tissue
#> 9 MEgrey whole_seedling -0.59551215 0.0010486148 Tissue
#> 10 MEmidnightblue endosperm -0.15025972 0.4544057353 Tissue
#> 11 MEmidnightblue pollen 0.01734256 0.9315803304 Tissue
#> 12 MEmidnightblue whole_seedling 0.11895932 0.5545294876 Tissue
#> 13 MEpurple endosperm 0.34624222 0.0768633212 Tissue
#> 14 MEpurple pollen 0.15124540 0.4514169837 Tissue
#> 15 MEpurple whole_seedling -0.42359091 0.0276840445 Tissue
#> 16 MEred endosperm 0.04884068 0.8088429045 Tissue
#> 17 MEred pollen 0.11272883 0.5755964071 Tissue
#> 18 MEred whole_seedling -0.13119761 0.5142119592 Tissue
#> 19 MEsalmon endosperm 0.26568876 0.1804225913 Tissue
#> 20 MEsalmon pollen 0.24020439 0.2274917146 Tissue
#> 21 MEsalmon whole_seedling -0.42209187 0.0282986267 Tissue
#> 22 MEblue compound_content 0.07975065 0.6925339219 compound_content
#> 23 MEcyan compound_content 0.20683423 0.3006063556 compound_content
#> 24 MEgrey compound_content 0.04589558 0.8201785431 compound_content
#> 25 MEmidnightblue compound_content -0.26328857 0.1845399040 compound_content
#> 26 MEpurple compound_content 0.33186280 0.0908131856 compound_content
#> 27 MEred compound_content 0.32037601 0.1032658856 compound_content
#> 28 MEsalmon compound_content 0.11437710 0.5699887999 compound_content
As you can see, BioNERO automatically recognizes that the variable compount_content
is continuous, so it calculates ME-variable correlations accordingly.
Does this solve your issue?
Created on 2024-03-14 with reprex v2.1.0
I successfully solved my problem under your guidance, thank you very much!
Great to know it worked for you! I'll close the issue, then.
Thank you for using BioNERO. ;-)
Best,
Fabricio