Error when Using ROC test for Marker Genes

Question

Error when Using ROC test for Marker Genes

Opened this issue 5 years ago · 1 comments

Thanks again for the great app! It's truly useful.

I'm having a issue when trying to use ROC test to get marker genes:

Error: Can't subset columns that don't exist.
x The column `p_val` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.

WIth rlang::last_error():


<error/vctrs_error_subscript_oob>
Can't subset columns that don't exist.
x The column `p_val` doesn't exist.
Backtrace:
  1. cerebroApp::getMarkerGenes(...)
 32. vctrs:::stop_subscript_oob(...)
 33. vctrs:::stop_subscript(...)
Run `rlang::last_trace()` to see the full context.

With rlang::last_trace():

1. ├─cerebroApp::getMarkerGenes(...)
 2. │ └─`%>%`(...)
 3. │   ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
 4. │   └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
 5. │     └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
 6. │       └─cerebroApp:::`_fseq`(`_lhs`)
 7. │         └─magrittr::freduce(value, `_function_list`)
 8. │           └─function_list[[i]](value)
 9. │             ├─dplyr::select(...)
10. │             └─dplyr:::select.data.frame(...)
11. │               └─tidyselect::vars_select(tbl_vars(.data), !!!enquos(...))
12. │                 └─tidyselect:::eval_select_impl(...)
13. │                   ├─tidyselect:::with_subscript_errors(...)
14. │                   │ ├─base::tryCatch(...)
15. │                   │ │ └─base:::tryCatchList(expr, classes, parentenv, handlers)
16. │                   │ │   └─base:::tryCatchOne(expr, names, parentenv, handlers[[1L]])
17. │                   │ │     └─base:::doTryCatch(return(expr), name, parentenv, handler)
18. │                   │ └─tidyselect:::instrument_base_errors(expr)
19. │                   │   └─base::withCallingHandlers(...)
20. │                   └─tidyselect:::vars_select_eval(...)
21. │                     └─tidyselect:::walk_data_tree(expr, data_mask, context_mask)
22. │                       └─tidyselect:::eval_c(expr, data_mask, context_mask)
23. │                         └─tidyselect:::reduce_sels(node, data_mask, context_mask, init = init)
24. │                           └─tidyselect:::walk_data_tree(new, data_mask, context_mask)
25. │                             └─tidyselect:::eval_c(expr, data_mask, context_mask)
26. │                               └─tidyselect:::reduce_sels(node, data_mask, context_mask, init = init)
27. │                                 └─tidyselect:::walk_data_tree(new, data_mask, context_mask)
28. │                                   └─tidyselect:::as_indices_sel_impl(...)
29. │                                     └─tidyselect:::as_indices_impl(x, vars, strict = strict)
30. │                                       └─tidyselect:::chr_as_locations(x, vars)
31. │                                         └─vctrs::vec_as_location(x, n = length(vars), names = vars)
32. └─vctrs:::stop_subscript_oob(...)
33.   └─vctrs:::stop_subscript(...)

I think this arises as new names are given for the p_value column, and getMarkerGenes still tries to make a table with the original names Seurat outputs when using Wilcox.

I want to use the ROC test because I'm actually looking for specific and exclusive cell markers, instead of differentially expressed genes, and ROC performs better for that purpose when compared to Wilcox. This would really make a difference in our analysis.

Any further insights? How can we fix this?

Answer 1 · 2020-03-12T15:48:19.000Z

Hi @davisidarta!

You are bringing up another (current) limitation of Cerebro :) And you suspect correctly that a very rigid structure for tables (e.g. marker genes) is expected. I have been thinking about a generalised structure for these types of results for a while, and in fact also sent out an email to the Bioconductor developer mailing list to get their feedback. It looks like I'll have to implement my own method for processing the output of different DGE tools for visualization in Cerebro. Not sure when I'll be able to put it into releasable form though.

A short term workaround would be to rename columns so that they meet the current format (and add fake columns if there aren't enough. At the moment, the format looks like this:

# A tibble: 1,181 x 8
   cluster gene          p_val avg_logFC pct.1 pct.2 p_val_adj on_cell_surface
   <fct>   <chr>         <dbl>     <dbl> <dbl> <dbl>     <dbl> <lgl>          
 1 1       S100A12    1.22e-88      2.87 0.993 0.106  1.34e-84 FALSE          
 2 1       AC020656.1 2.04e-86      1.79 0.961 0.075  2.24e-82 FALSE          
 3 1       CD14       1.25e-85      1.94 0.948 0.06   1.37e-81 TRUE           
 4 1       VCAN       5.27e-85      2.47 0.987 0.121  5.77e-81 FALSE          
 5 1       FCN1       8.92e-83      2.27 1     0.161  9.77e-79 FALSE          
 6 1       CSTA       2.05e-82      1.64 0.974 0.089  2.25e-78 FALSE          
 7 1       MS4A6A     4.16e-82      1.86 0.954 0.089  4.56e-78 FALSE          
 8 1       MNDA       2.31e-81      2.15 1     0.19   2.53e-77 FALSE          
 9 1       FGL2       6.20e-76      1.69 0.974 0.141  6.78e-72 FALSE          
10 1       CSF3R      2.87e-74      1.38 0.908 0.095  3.15e-70 FALSE          
# … with 1,171 more rows

The on_cell_surface column is optional.

Do you think that could work?

Sorry that I don't have an easier solution this time.

Best,
Roman