
Keys are shared for 2 rows

Closed this issue · 12 comments

When trying to import many samples from different runs, this results in the following error message:

Brief description of the problem

 nacho_data <- load_rcc(data_directory = params$path_rccs, 
+                        ssheet_csv = params$path_rcc_samplesheet,
+                        id_colname = "FILENAME")
[NACHO] Importing RCC files.
|=======================================================================================================================================================================================================================================================================================|100% ~0 s remaining     
[NACHO] Performing QC and formatting data.
[NACHO] Computing normalisation factors using "GEO" method.
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 2 rows:
* 6, 10

I'd love to debug this, but have no idea where I should be looking for - any hints for that?

You can use the following to get a more accurate idea of where the error occured:

    data_directory = params$path_rccs, 
    ssheet_csv = params$path_rcc_samplesheet,
    id_colname = "FILENAME"

My guess is the error involves a call to the internal function format_counts().

Also, are you sure the "FILENAME" are unique?
Are all your RCC files multiplexed or only one sample per file?

I only have one sample per file, also checked whether my metasheet contains duplicates which it doesn't- I'll try your suggestion now, thanks for the hint!

> length(unique(test$FILENAME))
[1] 245
> length(test$FILENAME)
[1] 245

Tried it with your suggestion but nothing else revealed :-(

 nacho_data <- with_abort(load_rcc(data_directory = params$path_rccs, 
+                        ssheet_csv = params$path_rcc_samplesheet,
+                        id_colname = "FILENAME")
+               )
[NACHO] Importing RCC files.
|=======================================================================================================================================================================================================================================================================================|100% ~0 s remaining     
[NACHO] Performing QC and formatting data.
[NACHO] Computing normalisation factors using "GEO" method.
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 2 rows:
* 6, 10

I fixed a typo in the code, i.e., the library is rlang.
And rlang::last_trace() ?

Could you try too build a small reproducible example? For example, by only using two samples in your sample sheet. Maybe the error is related to the files.

Yeah, I found that too - library(rlang) then resolved it 👍

I'm already trying to go over the files but its rather a lot (~500). Some of the samples don't show any housekeeping genes in the gene description - is that maybe the problem?

Also getting this if I only restrict to these RCC files + some of my newer data:

[NACHO] Performing QC and formatting data.
non-unique values when setting 'row.names': ‘ABCF1’, ‘CLTC’, ‘EEF1G’, ‘PGK1’, ‘QRICH1’, ‘RAF1’, ‘RPL19’, ‘TUBA1C’Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed

If I run only newer data that has properly set housekeeping genes in the RCC files, these run perfectly fine as before - its just if I run a mix of samples from one run 2 years ago without housekeeping genes specificlaly listed (these are listed "Endogenous" in the older run).

When I manually select a list of housekeeping genes, I still get that error message above. If I modify e.g. one RCC file and put all of the above genes into the same category as Housekeeping, the "Keys are shared ...." is popping up.

I designed/coded NACHO to read homogeneous set of RCC.
And tested NACHO using RCC file format above 1.6 (in particular I tested 1.6 and 1.7)


In that Case, I would suggest to try to QC separately the two sets and account for batch effect when merging datasets.

All data is created using FileVersion 1.7


I'll try to find out whats the probleme with the files then, I do think now too that this is probably somehow broken in some detail. The genes are identical on all RCC files, except for the definition of the categories. I guess I'll try to modify this in a way to make things work properly and apply a batch correction as you suggested.

Thanks for the insights!

i'll probably add some sanity check to ensure files are homogeneous.

I found the caveat: We have a duplicated gene in each table, e.g. 2x EGFR (although with different target sequence) - this causes the R method to complain about this not being unique, which is in fact not a problem of your tool but a (bad) design choice.

Sanity check sounds great - that would help future users to get a small info on whats a probable problem with their data.

Otherwise, again thanks for this really cool method/tool, it works very well for quite some data now!

This is solved by #36 and #20