load_rcc doesnt recognize subfolder
ChadAHighfill opened this issue · 7 comments
Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.
I am trying to implement NACHO on our nanostring data. I am have extreme trouble reading in the RCC files. Our data structure is a follows, I have a master folder RCC and a subfolder for each seq run. This subfolder has 12 samples. The number on each cartridge flow cell. I want to loop through each subfolder and have load_rcc do this. However, list.files() does not work with load_rcc. this returns warnings saying that load_rcc can find the file paths. Any help would be appreciated. However, when I put all rcc files into one folder, load_rcc works fine.
# insert reprex here
NACHO::load_rcc
requires a (single) directory path to the folder in which RCC files can be found.
Following NACHO vignettes (e.g., https://m.canouil.fr/NACHO/articles/NACHO-analysis.html), here an example with subfolders :
library(dplyr)
library(tidyr)
library(tibble)
library(NACHO)
library(GEOquery)
gse <- getGEO("GSE70970")
targets <- pData(phenoData(gse[[1]]))
getGEOSuppFiles(GEO = "GSE70970", baseDir = tempdir())
data_directory1 <- file.path(tempdir(), "GSE70970", "data", "dir1")
untar(
tarfile = file.path(tempdir(), "GSE70970", "GSE70970_RAW.tar"),
exdir = data_directory1
)
data_directory2 <- file.path(tempdir(), "GSE70970", "data", "dir2")
untar(
tarfile = file.path(tempdir(), "GSE70970", "GSE70970_RAW.tar"),
exdir = data_directory2
)
targets <- rbind(targets, targets)
data_directory <- file.path(tempdir(), "GSE70970", "data")
targets$IDFILE <- list.files(data_directory, pattern = "\\.RCC$|\\.RCC.gz$", recursive = TRUE)
GSE70970 <- load_rcc(data_directory, targets, id_colname = "IDFILE")
#> [NACHO] Importing RCC files.
#>
#> [NACHO] Performing QC and formatting data.
#> [NACHO] Computing normalisation factors using "GEO" method.
#> [NACHO] Missing values have been replaced with zeros for PCA.
#> [NACHO] Normalising data using "GEO" method with housekeeping genes.
#> [NACHO] Returning a list.
#> $ access : character
#> $ housekeeping_genes : character
#> $ housekeeping_predict: logical
#> $ housekeeping_norm : logical
#> $ normalisation_method: character
#> $ remove_outliers : logical
#> $ n_comp : numeric
#> $ data_directory : character
#> $ pc_sum : data.frame
#> $ nacho : data.frame
#> $ outliers_thresholds : list
Still can get this work:
Get all relevant files
the_files <- list.files(path = from_dir, recursive = TRUE, pattern = pattern)
[1] "20211130_209524441022_RCC/20211130_209524441022_SC718373_01.RCC"
[2] "20211130_209524441022_RCC/20211130_209524441022_SC718374_04.RCC"
[3] "20211130_209524441022_RCC/20211130_209524441022_SC718375_07.RCC"
[4] "20211130_209524441022_RCC/20211130_209524441022_SC794160_10.RCC"
[5] "20211130_209524441022_RCC/20211130_209524441022_SC794164_02.RCC"
list.files(data_directory, pattern = "\.RCC$|\.RCC.gz$", recursive = TRUE)
rcc <- load_rcc(
data_directory = the_files,
ssheet_csv = "PATH/Desktop/IDv1.csv",
id_colname = list.files(the_files, pattern = "\.RCC$|\.RCC.gz$", recursive = TRUE),
housekeeping_predict = TRUE,
)
[NACHO] Importing RCC files.
Error: Must extract column with a single valid subscript.
x Subscript id_colname
has size 0 but must be size 1.
Run rlang::last_error()
to see where the error occurred.
I simply want to utilize this useful package and loop through all the subfolders and read into RCC.
Currently, your code has no chance to work because it does not follow any of the load_rcc
requirements, please have a look at the documentation https://m.canouil.fr/NACHO/reference/load_rcc.html and its example.
rcc <- load_rcc(
data_directory = the_files, # this should be a directory, not a list of files
ssheet_csv = "PATH/Desktop/IDv1.csv", # this should contains a column with RCC filenames (and possibly subdirectory as in my examples)
id_colname = list.files(the_files, pattern = "\.RCC$|\.RCC.gz$", recursive = TRUE), # this should be a column name of "ssheet_csv", not a list of files
housekeeping_predict = TRUE
)
Hi,
When all the data is in a individual directory, my code works. However, as this is difficult to parse this back out. I will be dropping this. Thanks so much for the input.
install.packages("NACHO")
library("NACHO")
setwd("PATH" )
keep for now
rcc <- load_rcc(
data_directory = "PATH",
ssheet_csv = "PATH/IDv1.csv",
id_colname = "IDFILE",
housekeeping_predict = TRUE,
)
nacho_norm<- normalize(
nacho_object = rcc,
remove_outliers = TRUE
)
I was trying to back this out using the limited documentation....
Define from and to dirs, and the file pattern
from_dir <- "PATH"
to_dir <- "PATH1"
pattern <- ".RCC"
Get all relevant files
the_files <- list.files(path = from_dir, recursive = TRUE, pattern = pattern)
rcc <- load_rcc(
data_directory = the_files,
ssheet_csv = "PATH/IDv1.csv",
id_colname = "IDFILE",
housekeeping_predict = TRUE,
)
The issue is for some reason, the IDFILE that has all the names recognizes the rcc files in a single folder, but not in a list.files format. I dont understand why.
The way you are using NACHO is not at all the intended way, thus it can lead to unexpected results, as you making it works like that (it only works because of a "lucky" side-effect).
The documentation is quite clear (I think) about what should be the values and type for each arguments
data_directory
is the parent directory which can includes (as in my example before), multiple directories with RCC files.
/GSE70970/data
+-- dir1
| +-- GSM1824143_NPC-T-1.RCC.gz
| +-- GSM1824144_NPC-T-10.RCC.gz
| +-- GSM1824145_NPC-T-100.RCC.gz
| +-- ...
| \-- GSM1824405_NP-V-N9.RCC.gz
\-- dir2
+-- GSM1824143_NPC-T-1.RCC.gz
+-- GSM1824144_NPC-T-10.RCC.gz
+-- GSM1824145_NPC-T-100.RCC.gz
+-- ...
\-- GSM1824405_NP-V-N9.RCC.gz
Then, building the sample sheet with the "IDFILE" column which will be provided to "id_colname" argument.
Here, the IDFILE includes the sub-folders as well.
targets[c(1:5, 264:269), c(1:2, ncol(targets))]
#> title geo_accession IDFILE
#> GSM1824143 NPC-Training Set-1 GSM1824143 dir1/GSM1824143_NPC-T-1.RCC.gz
#> GSM1824144 NPC-Training Set-10 GSM1824144 dir1/GSM1824144_NPC-T-10.RCC.gz
#> GSM1824145 NPC-Training Set-100 GSM1824145 dir1/GSM1824145_NPC-T-100.RCC.gz
#> GSM1824146 NPC-Training Set-101 GSM1824146 dir1/GSM1824146_NPC-T-101.RCC.gz
#> GSM1824147 NPC-Training Set-102 GSM1824147 dir1/GSM1824147_NPC-T-102.RCC.gz
#> GSM18241431 NPC-Training Set-1 GSM1824143 dir2/GSM1824143_NPC-T-1.RCC.gz
#> GSM18241441 NPC-Training Set-10 GSM1824144 dir2/GSM1824144_NPC-T-10.RCC.gz
#> GSM18241451 NPC-Training Set-100 GSM1824145 dir2/GSM1824145_NPC-T-100.RCC.gz
#> GSM18241461 NPC-Training Set-101 GSM1824146 dir2/GSM1824146_NPC-T-101.RCC.gz
#> GSM18241471 NPC-Training Set-102 GSM1824147 dir2/GSM1824147_NPC-T-102.RCC.gz
#> GSM18241481 NPC-Training Set-103 GSM1824148 dir2/GSM1824148_NPC-T-103.RCC.gz
It will work exactly the same way in a "for" loop to go through directories.
for (idir in c("dir1", "dir2")) {
targets_subdir <- targets[dirname(targets[["IDFILE"]]) %in% idir, ]
targets_subdir[["IDFILE_nodir"]] <- basename(targets_subdir[["IDFILE"]])
assign(x = idir, value = load_rcc(file.path(data_directory, idir), targets_subdir, id_colname = "IDFILE_nodir"))
}
#> [NACHO] Importing RCC files.
#>
#> [NACHO] Performing QC and formatting data.
#> [NACHO] Computing normalisation factors using "GEO" method.
#> [NACHO] Missing values have been replaced with zeros for PCA.
#> [NACHO] Normalising data using "GEO" method with housekeeping genes.
#> [NACHO] Returning a list.
#> $ access : character
#> $ housekeeping_genes : character
#> $ housekeeping_predict: logical
#> $ housekeeping_norm : logical
#> $ normalisation_method: character
#> $ remove_outliers : logical
#> $ n_comp : numeric
#> $ data_directory : character
#> $ pc_sum : data.frame
#> $ nacho : data.frame
#> $ outliers_thresholds : list
#> [NACHO] Importing RCC files.
#>
#> [NACHO] Performing QC and formatting data.
#> [NACHO] Computing normalisation factors using "GEO" method.
#> [NACHO] Missing values have been replaced with zeros for PCA.
#> [NACHO] Normalising data using "GEO" method with housekeeping genes.
#> [NACHO] Returning a list.
#> $ access : character
#> $ housekeeping_genes : character
#> $ housekeeping_predict: logical
#> $ housekeeping_norm : logical
#> $ normalisation_method: character
#> $ remove_outliers : logical
#> $ n_comp : numeric
#> $ data_directory : character
#> $ pc_sum : data.frame
#> $ nacho : data.frame
#> $ outliers_thresholds : list
dir1
#> List of 11
#> $ access : chr "IDFILE_nodir"
#> $ housekeeping_genes : chr [1:5] "RPLP0" "RPL19" "ACTB" "GAPDH" ...
#> $ housekeeping_predict: logi FALSE
#> $ housekeeping_norm : logi TRUE
#> $ normalisation_method: chr "GEO"
#> $ remove_outliers : logi FALSE
#> $ n_comp : num 10
#> $ data_directory : chr "D:\\Profils\\mcanouil\\AppData\\Local\\Temp\\Rtmp4a7wNw\\GSE70970\\data\\dir1"
#> $ pc_sum :'data.frame': 10 obs. of 4 variables:
#> $ nacho :'data.frame': 198170 obs. of 119 variables:
#> $ outliers_thresholds :List of 6
#> - attr(*, "RCC_type")= chr "n1"
#> - attr(*, "class")= chr "nacho"
dir2
#> List of 11
#> $ access : chr "IDFILE_nodir"
#> $ housekeeping_genes : chr [1:5] "RPLP0" "RPL19" "ACTB" "GAPDH" ...
#> $ housekeeping_predict: logi FALSE
#> $ housekeeping_norm : logi TRUE
#> $ normalisation_method: chr "GEO"
#> $ remove_outliers : logi FALSE
#> $ n_comp : num 10
#> $ data_directory : chr "D:\\Profils\\mcanouil\\AppData\\Local\\Temp\\Rtmp4a7wNw\\GSE70970\\data\\dir2"
#> $ pc_sum :'data.frame': 10 obs. of 4 variables:
#> $ nacho :'data.frame': 198170 obs. of 119 variables:
#> $ outliers_thresholds :List of 6
#> - attr(*, "RCC_type")= chr "n1"
#> - attr(*, "class")= chr "nacho"
To summarise, i suggest/recommend to use the documented approach, otherwise I can not guarantee that the behaviour NACHO will exhibit is the one intended (and the correct one).
I am not at all confident the results you get using file path instead of directory path are correct.
Hi Mcanouil,
I think there might be some confusion between us. The inital way, is the way the way the documentation states. Regardless, our group likes the plots coming off autoplot! I will try to loop this as suggest.
Hum, I do not see in the documentation where load_rcc
uses files instead of a directory for the data_directory
parameter.
Can you tell me where you saw that? Are you using the latest version?
In your first code examples (and after), we can see that you used files not directory.
So, I do not see where is the confusion on my side.
data_directory
is a character vector of length one giving the path to a directory.
the_files
in your case is a character vector of length strictly greater than one giving the paths to RCC files, thus incorrect input for load_rcc
.
Still can get this work:
Get all relevant files
the_files <- list.files(path = from_dir, recursive = TRUE, pattern = pattern)
[1] "20211130_209524441022_RCC/20211130_209524441022_SC718373_01.RCC" [2] "20211130_209524441022_RCC/20211130_209524441022_SC718374_04.RCC" [3] "20211130_209524441022_RCC/20211130_209524441022_SC718375_07.RCC" [4] "20211130_209524441022_RCC/20211130_209524441022_SC794160_10.RCC" [5] "20211130_209524441022_RCC/20211130_209524441022_SC794164_02.RCC"
list.files(data_directory, pattern = ".RCC$|.RCC.gz$", recursive = TRUE)
rcc <- load_rcc( data_directory = the_files, ssheet_csv = "PATH/Desktop/IDv1.csv", id_colname = list.files(the_files, pattern = ".RCC$|.RCC.gz$", recursive = TRUE), housekeeping_predict = TRUE, ) [NACHO] Importing RCC files. Error: Must extract column with a single valid subscript. x Subscript
id_colname
has size 0 but must be size 1. Runrlang::last_error()
to see where the error occurred.I simply want to utilize this useful package and loop through all the subfolders and read into RCC.
For more help, try to make a small reproducible example using for instance the {reprex} R package.
And/or show your directory tree structure with fs::dir_tree
maybe.
Based on your different inputs, if I try to guess and write a working simple code, it should be:
library("NACHO")
data_directory <- "PATH"
ssheet_df <- data.frame(
sample_label = list.files(data_directory, pattern = "\\.RCC$|\\.RCC.gz$", recursive = TRUE)
)
load_rcc(
data_directory = data_directory,
ssheet_csv = ssheet_df,
id_colname = "sample_label"
)
The code above will import all RCC files found within data_directory
recursively.