mcanouil/NACHO

load_rcc doesnt recognize subfolder

ChadAHighfill opened this issue · 7 comments

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.


I am trying to implement NACHO on our nanostring data. I am have extreme trouble reading in the RCC files. Our data structure is a follows, I have a master folder RCC and a subfolder for each seq run. This subfolder has 12 samples. The number on each cartridge flow cell. I want to loop through each subfolder and have load_rcc do this. However, list.files() does not work with load_rcc. this returns warnings saying that load_rcc can find the file paths. Any help would be appreciated. However, when I put all rcc files into one folder, load_rcc works fine.

# insert reprex here

NACHO::load_rcc requires a (single) directory path to the folder in which RCC files can be found.

Following NACHO vignettes (e.g., https://m.canouil.fr/NACHO/articles/NACHO-analysis.html), here an example with subfolders :

library(dplyr)
library(tidyr)
library(tibble)
library(NACHO)
library(GEOquery)

gse <- getGEO("GSE70970")
targets <- pData(phenoData(gse[[1]]))
getGEOSuppFiles(GEO = "GSE70970", baseDir = tempdir())

data_directory1 <- file.path(tempdir(), "GSE70970", "data", "dir1")
untar(
  tarfile = file.path(tempdir(), "GSE70970", "GSE70970_RAW.tar"),
  exdir = data_directory1
)
data_directory2 <- file.path(tempdir(), "GSE70970", "data", "dir2")
untar(
  tarfile = file.path(tempdir(), "GSE70970", "GSE70970_RAW.tar"),
  exdir = data_directory2
)

targets <- rbind(targets, targets)

data_directory <- file.path(tempdir(), "GSE70970", "data")
targets$IDFILE <- list.files(data_directory, pattern = "\\.RCC$|\\.RCC.gz$", recursive = TRUE)

GSE70970 <- load_rcc(data_directory, targets, id_colname = "IDFILE")
#> [NACHO] Importing RCC files.
#> 
#> [NACHO] Performing QC and formatting data.
#> [NACHO] Computing normalisation factors using "GEO" method.
#> [NACHO] Missing values have been replaced with zeros for PCA.
#> [NACHO] Normalising data using "GEO" method with housekeeping genes.
#> [NACHO] Returning a list.
#>   $ access              : character
#>   $ housekeeping_genes  : character
#>   $ housekeeping_predict: logical
#>   $ housekeeping_norm   : logical
#>   $ normalisation_method: character
#>   $ remove_outliers     : logical
#>   $ n_comp              : numeric
#>   $ data_directory      : character
#>   $ pc_sum              : data.frame
#>   $ nacho               : data.frame
#>   $ outliers_thresholds : list

Still can get this work:

Get all relevant files

the_files <- list.files(path = from_dir, recursive = TRUE, pattern = pattern)

[1] "20211130_209524441022_RCC/20211130_209524441022_SC718373_01.RCC"
[2] "20211130_209524441022_RCC/20211130_209524441022_SC718374_04.RCC"
[3] "20211130_209524441022_RCC/20211130_209524441022_SC718375_07.RCC"
[4] "20211130_209524441022_RCC/20211130_209524441022_SC794160_10.RCC"
[5] "20211130_209524441022_RCC/20211130_209524441022_SC794164_02.RCC"

list.files(data_directory, pattern = "\.RCC$|\.RCC.gz$", recursive = TRUE)

rcc <- load_rcc(
data_directory = the_files,
ssheet_csv = "PATH/Desktop/IDv1.csv",
id_colname = list.files(the_files, pattern = "\.RCC$|\.RCC.gz$", recursive = TRUE),
housekeeping_predict = TRUE,
)
[NACHO] Importing RCC files.
Error: Must extract column with a single valid subscript.
x Subscript id_colname has size 0 but must be size 1.
Run rlang::last_error() to see where the error occurred.

I simply want to utilize this useful package and loop through all the subfolders and read into RCC.

Currently, your code has no chance to work because it does not follow any of the load_rcc requirements, please have a look at the documentation https://m.canouil.fr/NACHO/reference/load_rcc.html and its example.

rcc <- load_rcc(
  data_directory = the_files, # this should be a directory, not a list of files
  ssheet_csv = "PATH/Desktop/IDv1.csv", # this should contains a column with RCC filenames (and possibly subdirectory as in my examples)
  id_colname = list.files(the_files, pattern = "\.RCC$|\.RCC.gz$", recursive = TRUE), # this should be a column name of "ssheet_csv", not a list of files
  housekeeping_predict = TRUE
)

Hi,

When all the data is in a individual directory, my code works. However, as this is difficult to parse this back out. I will be dropping this. Thanks so much for the input.

install.packages("NACHO")
library("NACHO")

setwd("PATH" )

keep for now

rcc <- load_rcc(
data_directory = "PATH",
ssheet_csv = "PATH/IDv1.csv",
id_colname = "IDFILE",
housekeeping_predict = TRUE,
)

nacho_norm<- normalize(
nacho_object = rcc,
remove_outliers = TRUE
)

I was trying to back this out using the limited documentation....

Define from and to dirs, and the file pattern

from_dir <- "PATH"
to_dir <- "PATH1"
pattern <- ".RCC"

Get all relevant files

the_files <- list.files(path = from_dir, recursive = TRUE, pattern = pattern)

rcc <- load_rcc(
data_directory = the_files,
ssheet_csv = "PATH/IDv1.csv",
id_colname = "IDFILE",
housekeeping_predict = TRUE,
)

The issue is for some reason, the IDFILE that has all the names recognizes the rcc files in a single folder, but not in a list.files format. I dont understand why.

The way you are using NACHO is not at all the intended way, thus it can lead to unexpected results, as you making it works like that (it only works because of a "lucky" side-effect).

The documentation is quite clear (I think) about what should be the values and type for each arguments
image

data_directory is the parent directory which can includes (as in my example before), multiple directories with RCC files.

/GSE70970/data
+-- dir1
|   +-- GSM1824143_NPC-T-1.RCC.gz
|   +-- GSM1824144_NPC-T-10.RCC.gz
|   +-- GSM1824145_NPC-T-100.RCC.gz
|   +-- ...
|   \-- GSM1824405_NP-V-N9.RCC.gz
\-- dir2
    +-- GSM1824143_NPC-T-1.RCC.gz
    +-- GSM1824144_NPC-T-10.RCC.gz
    +-- GSM1824145_NPC-T-100.RCC.gz
    +-- ...
    \-- GSM1824405_NP-V-N9.RCC.gz

Then, building the sample sheet with the "IDFILE" column which will be provided to "id_colname" argument.
Here, the IDFILE includes the sub-folders as well.

targets[c(1:5, 264:269), c(1:2, ncol(targets))]
#>                            title geo_accession                           IDFILE
#> GSM1824143    NPC-Training Set-1    GSM1824143   dir1/GSM1824143_NPC-T-1.RCC.gz
#> GSM1824144   NPC-Training Set-10    GSM1824144  dir1/GSM1824144_NPC-T-10.RCC.gz
#> GSM1824145  NPC-Training Set-100    GSM1824145 dir1/GSM1824145_NPC-T-100.RCC.gz
#> GSM1824146  NPC-Training Set-101    GSM1824146 dir1/GSM1824146_NPC-T-101.RCC.gz
#> GSM1824147  NPC-Training Set-102    GSM1824147 dir1/GSM1824147_NPC-T-102.RCC.gz
#> GSM18241431   NPC-Training Set-1    GSM1824143   dir2/GSM1824143_NPC-T-1.RCC.gz
#> GSM18241441  NPC-Training Set-10    GSM1824144  dir2/GSM1824144_NPC-T-10.RCC.gz
#> GSM18241451 NPC-Training Set-100    GSM1824145 dir2/GSM1824145_NPC-T-100.RCC.gz
#> GSM18241461 NPC-Training Set-101    GSM1824146 dir2/GSM1824146_NPC-T-101.RCC.gz
#> GSM18241471 NPC-Training Set-102    GSM1824147 dir2/GSM1824147_NPC-T-102.RCC.gz
#> GSM18241481 NPC-Training Set-103    GSM1824148 dir2/GSM1824148_NPC-T-103.RCC.gz

It will work exactly the same way in a "for" loop to go through directories.

for (idir in c("dir1", "dir2")) {
  targets_subdir <- targets[dirname(targets[["IDFILE"]]) %in% idir, ]
  targets_subdir[["IDFILE_nodir"]] <- basename(targets_subdir[["IDFILE"]])
  assign(x = idir, value = load_rcc(file.path(data_directory, idir), targets_subdir, id_colname = "IDFILE_nodir"))
}
#> [NACHO] Importing RCC files.
#> 
#> [NACHO] Performing QC and formatting data.
#> [NACHO] Computing normalisation factors using "GEO" method.
#> [NACHO] Missing values have been replaced with zeros for PCA.
#> [NACHO] Normalising data using "GEO" method with housekeeping genes.
#> [NACHO] Returning a list.
#>   $ access              : character
#>   $ housekeeping_genes  : character
#>   $ housekeeping_predict: logical
#>   $ housekeeping_norm   : logical
#>   $ normalisation_method: character
#>   $ remove_outliers     : logical
#>   $ n_comp              : numeric
#>   $ data_directory      : character
#>   $ pc_sum              : data.frame
#>   $ nacho               : data.frame
#>   $ outliers_thresholds : list
#> [NACHO] Importing RCC files.
#> 
#> [NACHO] Performing QC and formatting data.
#> [NACHO] Computing normalisation factors using "GEO" method.
#> [NACHO] Missing values have been replaced with zeros for PCA.
#> [NACHO] Normalising data using "GEO" method with housekeeping genes.
#> [NACHO] Returning a list.
#>   $ access              : character
#>   $ housekeeping_genes  : character
#>   $ housekeeping_predict: logical
#>   $ housekeeping_norm   : logical
#>   $ normalisation_method: character
#>   $ remove_outliers     : logical
#>   $ n_comp              : numeric
#>   $ data_directory      : character
#>   $ pc_sum              : data.frame
#>   $ nacho               : data.frame
#>   $ outliers_thresholds : list
dir1
#> List of 11
#>  $ access              : chr "IDFILE_nodir"
#>  $ housekeeping_genes  : chr [1:5] "RPLP0" "RPL19" "ACTB" "GAPDH" ...
#>  $ housekeeping_predict: logi FALSE
#>  $ housekeeping_norm   : logi TRUE
#>  $ normalisation_method: chr "GEO"
#>  $ remove_outliers     : logi FALSE
#>  $ n_comp              : num 10
#>  $ data_directory      : chr "D:\\Profils\\mcanouil\\AppData\\Local\\Temp\\Rtmp4a7wNw\\GSE70970\\data\\dir1"
#>  $ pc_sum              :'data.frame':   10 obs. of  4 variables:
#>  $ nacho               :'data.frame':   198170 obs. of  119 variables:
#>  $ outliers_thresholds :List of 6
#>  - attr(*, "RCC_type")= chr "n1"
#>  - attr(*, "class")= chr "nacho"
dir2
#> List of 11
#>  $ access              : chr "IDFILE_nodir"
#>  $ housekeeping_genes  : chr [1:5] "RPLP0" "RPL19" "ACTB" "GAPDH" ...
#>  $ housekeeping_predict: logi FALSE
#>  $ housekeeping_norm   : logi TRUE
#>  $ normalisation_method: chr "GEO"
#>  $ remove_outliers     : logi FALSE
#>  $ n_comp              : num 10
#>  $ data_directory      : chr "D:\\Profils\\mcanouil\\AppData\\Local\\Temp\\Rtmp4a7wNw\\GSE70970\\data\\dir2"
#>  $ pc_sum              :'data.frame':   10 obs. of  4 variables:
#>  $ nacho               :'data.frame':   198170 obs. of  119 variables:
#>  $ outliers_thresholds :List of 6
#>  - attr(*, "RCC_type")= chr "n1"
#>  - attr(*, "class")= chr "nacho"

To summarise, i suggest/recommend to use the documented approach, otherwise I can not guarantee that the behaviour NACHO will exhibit is the one intended (and the correct one).
I am not at all confident the results you get using file path instead of directory path are correct.

Hi Mcanouil,

I think there might be some confusion between us. The inital way, is the way the way the documentation states. Regardless, our group likes the plots coming off autoplot! I will try to loop this as suggest.

Hum, I do not see in the documentation where load_rcc uses files instead of a directory for the data_directory parameter.
Can you tell me where you saw that? Are you using the latest version?

In your first code examples (and after), we can see that you used files not directory.
So, I do not see where is the confusion on my side.
data_directory is a character vector of length one giving the path to a directory.
the_files in your case is a character vector of length strictly greater than one giving the paths to RCC files, thus incorrect input for load_rcc.

Still can get this work:

Get all relevant files

the_files <- list.files(path = from_dir, recursive = TRUE, pattern = pattern)

[1] "20211130_209524441022_RCC/20211130_209524441022_SC718373_01.RCC" [2] "20211130_209524441022_RCC/20211130_209524441022_SC718374_04.RCC" [3] "20211130_209524441022_RCC/20211130_209524441022_SC718375_07.RCC" [4] "20211130_209524441022_RCC/20211130_209524441022_SC794160_10.RCC" [5] "20211130_209524441022_RCC/20211130_209524441022_SC794164_02.RCC"

list.files(data_directory, pattern = ".RCC$|.RCC.gz$", recursive = TRUE)

rcc <- load_rcc( data_directory = the_files, ssheet_csv = "PATH/Desktop/IDv1.csv", id_colname = list.files(the_files, pattern = ".RCC$|.RCC.gz$", recursive = TRUE), housekeeping_predict = TRUE, ) [NACHO] Importing RCC files. Error: Must extract column with a single valid subscript. x Subscript id_colname has size 0 but must be size 1. Run rlang::last_error() to see where the error occurred.

I simply want to utilize this useful package and loop through all the subfolders and read into RCC.

For more help, try to make a small reproducible example using for instance the {reprex} R package.
And/or show your directory tree structure with fs::dir_tree maybe.

Based on your different inputs, if I try to guess and write a working simple code, it should be:

library("NACHO")
data_directory <- "PATH"
ssheet_df <- data.frame(
  sample_label = list.files(data_directory, pattern = "\\.RCC$|\\.RCC.gz$", recursive = TRUE)
)
load_rcc(
  data_directory = data_directory, 
  ssheet_csv = ssheet_df, 
  id_colname = "sample_label"
)

The code above will import all RCC files found within data_directory recursively.