mcanouil/NACHO

Keys are shared for 2 rows

Closed this issue · 12 comments

When trying to import many samples from different runs, this results in the following error message:

Brief description of the problem

 nacho_data <- load_rcc(data_directory = params$path_rccs, 
+                        ssheet_csv = params$path_rcc_samplesheet,
+                        id_colname = "FILENAME")
[NACHO] Importing RCC files.
|=======================================================================================================================================================================================================================================================================================|100% ~0 s remaining     
[NACHO] Performing QC and formatting data.
[NACHO] Computing normalisation factors using "GEO" method.
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 2 rows:
* 6, 10

I'd love to debug this, but have no idea where I should be looking for - any hints for that?

You can use the following to get a more accurate idea of where the error occured:

rlang::with_abort(
  load_rcc(
    data_directory = params$path_rccs, 
    ssheet_csv = params$path_rcc_samplesheet,
    id_colname = "FILENAME"
  )
)

My guess is the error involves a call to the internal function format_counts().

Also, are you sure the "FILENAME" are unique?
Are all your RCC files multiplexed or only one sample per file?

I only have one sample per file, also checked whether my metasheet contains duplicates which it doesn't- I'll try your suggestion now, thanks for the hint!

> length(unique(test$FILENAME))
[1] 245
> length(test$FILENAME)
[1] 245

Tried it with your suggestion but nothing else revealed :-(

 nacho_data <- with_abort(load_rcc(data_directory = params$path_rccs, 
+                        ssheet_csv = params$path_rcc_samplesheet,
+                        id_colname = "FILENAME")
+               )
[NACHO] Importing RCC files.
|=======================================================================================================================================================================================================================================================================================|100% ~0 s remaining     
[NACHO] Performing QC and formatting data.
[NACHO] Computing normalisation factors using "GEO" method.
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 2 rows:
* 6, 10

I fixed a typo in the code, i.e., the library is rlang.
And rlang::last_trace() ?

Could you try too build a small reproducible example? For example, by only using two samples in your sample sheet. Maybe the error is related to the files.

Yeah, I found that too - library(rlang) then resolved it 👍

I'm already trying to go over the files but its rather a lot (~500). Some of the samples don't show any housekeeping genes in the gene description - is that maybe the problem?

Also getting this if I only restrict to these RCC files + some of my newer data:

[NACHO] Performing QC and formatting data.
non-unique values when setting 'row.names': ‘ABCF1’, ‘CLTC’, ‘EEF1G’, ‘PGK1’, ‘QRICH1’, ‘RAF1’, ‘RPL19’, ‘TUBA1C’Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed

If I run only newer data that has properly set housekeeping genes in the RCC files, these run perfectly fine as before - its just if I run a mix of samples from one run 2 years ago without housekeeping genes specificlaly listed (these are listed "Endogenous" in the older run).

When I manually select a list of housekeeping genes, I still get that error message above. If I modify e.g. one RCC file and put all of the above genes into the same category as Housekeeping, the "Keys are shared ...." is popping up.

I designed/coded NACHO to read homogeneous set of RCC.
And tested NACHO using RCC file format above 1.6 (in particular I tested 1.6 and 1.7)

<Header>
FileVersion,1.7
SoftwareVersion,4.0.0.3
</Header>

In that Case, I would suggest to try to QC separately the two sets and account for batch effect when merging datasets.

All data is created using FileVersion 1.7

<Header>
FileVersion,1.7
SoftwareVersion,3.0.1.4
</Header>

I'll try to find out whats the probleme with the files then, I do think now too that this is probably somehow broken in some detail. The genes are identical on all RCC files, except for the definition of the categories. I guess I'll try to modify this in a way to make things work properly and apply a batch correction as you suggested.

Thanks for the insights!

i'll probably add some sanity check to ensure files are homogeneous.

I found the caveat: We have a duplicated gene in each table, e.g. 2x EGFR (although with different target sequence) - this causes the R method to complain about this not being unique, which is in fact not a problem of your tool but a (bad) design choice.

Sanity check sounds great - that would help future users to get a small info on whats a probable problem with their data.

Otherwise, again thanks for this really cool method/tool, it works very well for quite some data now!

This is solved by #36 and #20