ncats/cellranger-snakemake

Add filtration of various 'noisy' genes

Closed this issue · 2 comments

There is a list of house-keeping genes, pseudo-genes, etc. that should be removed from these kinds of analysis. Import that list and remove them from the analysis, probably after the conversion to gene symbols from ensg.genes.

Need to find out how to refer to a file in a script's BinDir... or even how to find a script's BinDir in R. In my searches I'm seeing a plethora of circumstance-specific hacks (for example, "From within RStudio except on Windows" or "not in Rstudio but not on Linux") but no definitive method.
I would rather not have to ask the user (or wrapper script, as in the pipeline case) to:

  • Copy a file into the workdir
  • Always invoke with a parameter pointing to the default list
  • Edit the path to the default list within the Rscript the first time this repo is cloned

But so far I don't see the ideal solution, which would be a native-language solution for referring to a script's location on the file system from within the script. The first two options could be ok for runs from within a pipeline, but the default location of the list of genes to filter would still need to be hand-edited into the pipeline script(s) prior to the first invocation.

I might have to turn this into a package in order to do that in R... I think I saw somewhere that a script within a package could refer to locations within that package. That is, an R script could reference relative paths from within a package directory.

Of course, if this isn't widely distributed it's probably ok to hard-code a path... except I don't know of a path that would be accessible to everywhere... other than making a path relative to the bindir... sigh...

Solved by having the pipeline wrapper copy the file into the workdir during execution. One good thing this does is also provide a copy of the exact files (will also copy the EnsemblID->Gene Symbol mapping file) used for each particular run. The files aren't large so this isn't a burden on either the processing time or resources.