bedapub/besca

Spaces in barcode index result in export inconsistencies and MongoDB errors

Closed this issue · 2 comments

Besca uses CELL column (barcode and sample ID concatenation) from metadata.tsv as index in adata.
Important: CELL should not contain spaces (this can happen if a space is present in sample ID as 48301_CD20 TCB_6h_sc, eg when retrieving metadata from study registration). Resulting name in CELL is BC_604659.48301_CD20 TCB_6h_sc

Besca workflow doesn't complain with that space, however, for all exports the name is truncated in the space. This results into inconsistencies in barcode names for the different files, thus raising MongoDB upload errors.

Possible solution: raise an error early on, during read mtx if CELL contain spaces. Should this check also be implemented when registering samples in study registration?

@llumdi please have a look at the commit 14b233a. It silently replaces all spaces with a _. We could add a warning.
If such a replacement is needed somewhere else as well please comment here.
One could also add some replacements on the export side to ensure that everything which is exported can be ingested into MongoDB.

Thanks for fixing this @kohleman. To add a warning is good, so that the user is 1) aware that shouldn't use spaces 2) can correct names in case it was overlooked at the input file or at the study registration (aka MOOSE).

And yes, I agree that adding some checks in the exporting functions to make sure the export is compliant with MongoDB is also a good suggestion . Also, in MOOSE it should be clearly stated to not use spaces for sample names.