Path to converted per-cell matrices should reflect complete path relative to `geo` dir
hannes-ucsc opened this issue · 7 comments
(.venv) hannes@ip-172-31-39-22:~$ cd ~/load-project/projects/b0f40b69-943f-5959-9457-c8e53c2d480e/matrices
(.venv) hannes@ip-172-31-39-22:~/load-project/projects/b0f40b69-943f-5959-9457-c8e53c2d480e/matrices$ ls -l
total 24
drwxrwsr-x 2 ubuntu ucsc 4096 Jan 14 22:57 GSE103275_RAW__GSM2510616_P4-matrix.mtx.gz
drwxrwsr-x 2 ubuntu ucsc 4096 Jan 14 22:57 GSE103275_RAW__GSM2510617_P7-matrix.mtx.gz
drwxrwsr-x 2 ubuntu ucsc 4096 Jan 14 22:57 GSE103275_RAW__GSM2759554_5wk-1-matrix.mtx.gz
drwxrwsr-x 2 ubuntu ucsc 4096 Jan 14 22:57 GSE103275_RAW__GSM2759555_5wk-2-matrix.mtx.gz
drwxrwsr-x 2 ubuntu ucsc 4096 Jan 14 22:57 GSE103275_RAW__GSM2759556_P7D-matrix.mtx.gz
drwxrwsr-x 2 ubuntu ucsc 4096 Jan 14 22:57 GSE103275_RAW__GSM2759557_P7E-matrix.mtx.gz
(.venv) hannes@ip-172-31-39-22:~/load-project/projects/b0f40b69-943f-5959-9457-c8e53c2d480e/matrices$ cd ~/load-project/projects/b26137d3-a709-5492-aa74-0d783e6b628b/matrices
(.venv) hannes@ip-172-31-39-22:~/load-project/projects/b26137d3-a709-5492-aa74-0d783e6b628b/matrices$ ls -l
total 4
drwxrwsr-x 2 ubuntu ucsc 4096 Jan 17 07:02 cell_files
That directory name should be GSE110154_RAW__cell_files
.
Correction: I wasn't aware that cell_files
is made up. We should avoid making up path elements. So the directory name should be just GSE110154_RAW
.
What about cases where there are multiple mutually incompatible cohorts of cell files mixed together in the RAW directory? Consider GSE75659. I decided on this approach to avoid potential name collisions between cohorts. Note the assertion I added to Converter._convert_matrices
.
An alternative, I suppose, would be:
- Remove the unique names assertion
- Merge the
name
anddirectory
parameters toIndividualCellFiles
- Have
Converter._convert_matrices
append an incrementor to non-unique names
But I don't think that really solves the problem since the matrix names still wouldn't be correct paths due to the incrementor. In short, I don't think we can use paths here because of incompatible cohorts being placed in the same folder.
If the per-cell files reside directly in the geo
directory (as opposed to in a subdirectory of the geo directory), the MTX triplet of files should be placed into the matrices
folder directly. You may have to update DPSS to account for that but it's the most logical mapping, in my view.
When you split the set of files up into cohorts (subsets of that set), give the cohort a name and append the cohort name to the input path. So
Cohort X:
geo/a/1.csv
geo/a/2.csv
becomes
matrices/a__X/matrix.mtx.gz
matrices/a__X/barcodes.gz
matrices/a__X/genes.gz
Cohort Y:
geo/a/3.csv
geo/a/4.csv
becomes
matrices/a__Y/matrix.mtx.gz
matrices/a__Y/barcodes.gz
matrices/a__Y/genes.gz
If there was no a
subdir,
Cohort X:
geo/1.csv
geo/2.csv
becomes
matrices/X/matrix.mtx.gz
matrices/X/barcodes.gz
matrices/X/genes.gz
Cohort Y:
geo/3.csv
geo/4.csv
becomes
matrices/Y/matrix.mtx.gz
matrices/Y/barcodes.gz
matrices/Y/genes.gz
Does that make sense?
@hannes-ucsc Shouldn't cohort Y yield:
matrices/a__Y/matrix.mtx.gz
matrices/a__Y/barcodes.gz
matrices/a__Y/genes.gz
? a
is the full path to the directory containing the per-cell csv files (relative to the geo
dir), Y
is the name of the cohort, and __
is the separator?
Yes, that's correct. I made a copy paste error.
To avoid confusion, I'll edit my original comment.