JenniNiku/gllvm

Data Preparation for Fourth Corner Analysis

Closed this issue · 4 comments

Hi,

Thank you so much for this amazing package. I am a microbiologist with very minimal knowledge of coding :0

It's been a while since I tried to run the fourth corner analysis, but I cannot. Hopefully, you can help me:

Problem: My data x, y, and TR are not properly created and I received different errors from error in cbind to having an empty TR in the fourth corner.

What I have done so far: I tried to create the X: 90X5, y: 90X90 and TR: 90X5. Transformed and rotated them so many times. Made sure the row are matched and it did not help.

My data: I am working with a very large amount of data. I have about 27000 traits (metabolic pathways). I picked only 5 pathways to just check the data but it didn't work. For my environmental data I only choose the numeric columns and used as.numeric function to make sure they are.

My data without any reduction:
This is my x:
data.env.new <- read_tsv("./samdf_complete.tsv",
col_types = cols_only(pH = col_double(),
Conductivity = col_double(), Temp = col_double(),
ORP =col_double(), SO4 = col_double()))
You can find the samdf_complete.tsv here without any manipulation (write it into csv to be compatible with Git):
samdf_complete.csv

This is my y: ( my relative abundance data. First column is the sample names, the rest are my ASVs)
data_for_gllvm.csv

This is my TR: (my metabolic pathways. First column is the metabolic pathways nemae. THe rest are my ASVs)
pathways_wider.csv

These are the pathways I wanted to test:
pathways_subset <- pathways_wider[ , c("PWY-5304" , "PWY-5345" ,
"PWY-6581", "METH-ACETATE-PWY" , "PWY-181"), ]

Also to choose the same ASVs for Y and TR I get the common cols by using this code:

Get the common column names

common_cols <- intersect(colnames(data_frame_for_gllvm), colnames(pathways_wider))

The code I used:

library(tidyverse)
data.env.new <- read_tsv("./samdf_complete.tsv",
                     col_types = cols_only(pH = col_double(),
                                           Conductivity = col_double(), Temp = col_double(),
                                           ORP =col_double(), SO4 = col_double()))


data.env.new <- as.data.frame(data.env.new)
data.env.new$pH <- as.numeric(data.env.new$pH)
data.env.new$Conductivity <- as.numeric(data.env.new$Conductivity)
data.env.new$Temp <- as.numeric(data.env.new$Temp)
data.env.new$ORP <- as.numeric(data.env.new$ORP)
data.env.new$SO4 <- as.numeric(data.env.new$SO4)


### Read in the pathway file
library(readr)

### read only the columns "function" and "taxon"
pathways <- read_tsv("./picrust2_out_pipeline/pathways_out/path_abun_contrib.tsv", 
                     col_types = cols_only("path" = col_character(), "taxon" = col_character(), "taxon_function_abun" = col_character()))

### Pivot the table 
pathways_wider <- data.frame(stringsAsFactors = FALSE)
for (row in 1:nrow(pathways)) {
  pathways_wider[pathways[row, ]$path, pathways[row, ]$taxon] <- pathways[row, ]$taxon_function_abun
}


pathways_wider <- as.data.frame(pathways_wider)


### subset the table to keep only specific row by their names
pathways_subset <- pathways_wider_subset[ , c("PWY-5304" , "PWY-5345" , 
                                "PWY-6581", "METH-ACETATE-PWY" , "PWY-181"), ]

pathways_subset <- t(pathways_subset)
pathways_subset <- as.data.frame(pathways_subset)

### so many combinations of subsetting and transformation (I tried so many times that I lost track of which came first: 
pathways_wider_1 <- pathways_wider_subset[, 1:5]
pathways_wider_1 <- t(pathways_wider_1)
pathways_wider_1 <- pathways_wider_1[,1:90]
pathways_wider_1 <- as.data.frame(pathways_wider_1)
data_frame_for_gllvm_1 <- t(data_frame_for_gllvm_1)
data_frame_for_gllvm_1 <- data_frame_for_gllvm_1 [,1:90]

### Get the common column names
common_cols <- intersect(colnames(data_frame_for_gllvm), colnames(pathways_wider))


data_frame_for_gllvm_1 <- data_frame_for_gllvm[, c(common_cols)]
data_frame_for_gllvm_1 <- t(data_frame_for_gllvm_1)


### Create a list containing your data frames
my_data.new <- list(rel = data_frame_for_gllvm_1, env.new = data.env.new, TR = pathways_wider_1)


fit_4th <- gllvm(y = my_data.new$rel, X = my_data.new$env.new, TR = my_data.new$TR,
                 formula = y ~ (pH + Conductivity + SO4 + Temp + ORP),
                   family = "negative.binomial", num.lv = 4, seed = 123, row.eff = "random", 
                 control.start =list(n.init = 3, jitter.var = 0.01))

coefplot(fit_4th, cex.ylab = 0.5, y.label = TRUE, mar = c(4, 2, 2, 1))

I hope I have provided enough information; if not, please let me know. I really appreciate your help.

Thank you,
Z.

Thanks for the code! I was unable to reproduce your issues though; you provide a file "pathways_wider", but it is not clear from the script where that should come in.

Generally, the easiest format for your data is "wide" so that the number of columns in y is the same as the number of rows in TR, and the number of rows in y is the same as the number of rows in X. The number of columns in X and TR are then "free".

If you need more help, please incorporate the pathway file in your code above so I that can reproduce your issues.

Thank you so much Bert. @BertvanderVeen

I ended up making a subset of pathway_wider called pathway_wider_1:

pathways_wider_1.csv

And this is the final code:

# Create a list containing your data frames
my_data.new <- list(rel = data_frame_for_gllvm_1, env.new = data.env.new, TR = pathways_wider_1)


fit_4th <- gllvm(y = my_data.new$rel, X = my_data.new$env.new, TR = my_data.new$TR,
                 formula = y ~ (pH + Conductivity + SO4 + Temp + ORP),
                 family = "negative.binomial", num.lv = 4, seed = 123, row.eff = "random", 
                 control.start =list(n.init = 3, jitter.var = 0.01))

I hope this help reproduce my issue.
Thank you,
z.

Not really - but your matrix for X looks fine. One potential issue with "pathways_wider_1.csv" from above is that it contains NAs, which is not allowed.

Thank you so much @BertvanderVeen . That was the problem. I removed the NAs and it worked.