R dataframe errors

Question

R dataframe errors

anadkarn opened this issue 4 years ago · 1 comments

I'm having an issue with running a dataset.
I’ve double-checked the input requirements and confirmed I should have no duplicate Ensembl IDs.

Additionally, I’ve attempted to take a subset of our input and swap the ensembl ids out with some from PRAD.ptenloss.M.tsv, the working example provided on the GSECA Git repo. This yields a different error.

In both scenarios, I'm using the corresponding gene expression matrix, case control list and gmt files, but using the same settings as presented in the README. While I'm unsure about the correctness of the gene expression matrix and case control list file, the gmt file has been used successfully with another dataset, so I believe that that file is formatted correctly.
For each case, I’ve created a zip file that corresponds to the inputs that we used to get that error message.
I’ve copied the error messages below. I would really appreciate any help you could provide to understand how I need to modify the inputs so that I can run this example.
Thank you!

#######################################################################
Error Messages:
Case 1 (bigger gene expression matrix, with original ensembl ids):
[] loading datasets ...
[] GSECA running ...
=== FMM and DD ===
[] NE threshold: 0.01
[] Number of Expression Classes: 7
-- Parallelizing 2 cores ...
[] FMM Getting thresholds ...
[] DD Setting expression classes ...
|======================================================================| 100%
Error in $<-.data.frame(*tmp*, "type", value = "CNTR") :
replacement has 1 row, data has 0
Calls: applyGSECA ... gene_class_representation -> mapply -> -> $<- -> $<-.data.frame
Execution halted

case1inputs.zip

Case 2 (using ensembl ids from dataset in github, reduced size as I was just attempting to get some portion of this example to run):
[] GSECA running ...
=== FMM and DD ===
[] NE threshold: 0.01
[] Number of Expression Classes: 7
-- Parallelizing 2 cores ...
[] FMM Getting thresholds ...
[*] DD Setting expression classes ...
|======================================================================| 100%
Error in [.default(t, , , "CASE") : subscript out of bounds
Calls: applyGSECA ... gene_class_representation -> as.data.frame -> [ -> [.table -> NextMethod
In addition: Warning message:
In get_mixture_expr_class(expr, ne_value = 0.01, nc = N.CORES) :
-- Excluding 2 samples not fitted by the mixture
Execution halted

case2inputs.zip

Answer 1 · 2020-08-13T09:18:23.000Z

HI Annie,
thanks for posting. The error is the encoding of the ENSG that you have in you dataframe (i.e. "ensgXX.version"). You should remove the "version" of each ENSGs (i.e. "ensgXXX). You can simply add this line in the code

...
# Gene expression matrix & sample type
M = read.delim(HSV1_gene_expression_matrix.tsv")
L = read.delim("HSV_case_ctrl_list.tsv")[,1]

M$ensembl_gene_id = sapply(strsplit(M$ensembl_gene_id, '\\.'),'[[',1)

# Gene set list ....

Once removed, GSECA will run properly on your inputs.
BW,
Matteo