DailyDreaming/load-project

Path to converted per-cell matrices should reflect complete path relative to `geo` dir

hannes-ucsc opened this issue · 7 comments

(.venv) hannes@ip-172-31-39-22:~$ cd ~/load-project/projects/b0f40b69-943f-5959-9457-c8e53c2d480e/matrices
(.venv) hannes@ip-172-31-39-22:~/load-project/projects/b0f40b69-943f-5959-9457-c8e53c2d480e/matrices$ ls -l
total 24
drwxrwsr-x 2 ubuntu ucsc 4096 Jan 14 22:57 GSE103275_RAW__GSM2510616_P4-matrix.mtx.gz
drwxrwsr-x 2 ubuntu ucsc 4096 Jan 14 22:57 GSE103275_RAW__GSM2510617_P7-matrix.mtx.gz
drwxrwsr-x 2 ubuntu ucsc 4096 Jan 14 22:57 GSE103275_RAW__GSM2759554_5wk-1-matrix.mtx.gz
drwxrwsr-x 2 ubuntu ucsc 4096 Jan 14 22:57 GSE103275_RAW__GSM2759555_5wk-2-matrix.mtx.gz
drwxrwsr-x 2 ubuntu ucsc 4096 Jan 14 22:57 GSE103275_RAW__GSM2759556_P7D-matrix.mtx.gz
drwxrwsr-x 2 ubuntu ucsc 4096 Jan 14 22:57 GSE103275_RAW__GSM2759557_P7E-matrix.mtx.gz
(.venv) hannes@ip-172-31-39-22:~/load-project/projects/b0f40b69-943f-5959-9457-c8e53c2d480e/matrices$ cd ~/load-project/projects/b26137d3-a709-5492-aa74-0d783e6b628b/matrices
(.venv) hannes@ip-172-31-39-22:~/load-project/projects/b26137d3-a709-5492-aa74-0d783e6b628b/matrices$ ls -l
total 4
drwxrwsr-x 2 ubuntu ucsc 4096 Jan 17 07:02 cell_files

That directory name should be GSE110154_RAW__cell_files.

Correction: I wasn't aware that cell_files is made up. We should avoid making up path elements. So the directory name should be just GSE110154_RAW.

What about cases where there are multiple mutually incompatible cohorts of cell files mixed together in the RAW directory? Consider GSE75659. I decided on this approach to avoid potential name collisions between cohorts. Note the assertion I added to Converter._convert_matrices.

An alternative, I suppose, would be:

  1. Remove the unique names assertion
  2. Merge the name and directory parameters to IndividualCellFiles
  3. Have Converter._convert_matrices append an incrementor to non-unique names

But I don't think that really solves the problem since the matrix names still wouldn't be correct paths due to the incrementor. In short, I don't think we can use paths here because of incompatible cohorts being placed in the same folder.

If the per-cell files reside directly in the geo directory (as opposed to in a subdirectory of the geo directory), the MTX triplet of files should be placed into the matrices folder directly. You may have to update DPSS to account for that but it's the most logical mapping, in my view.

When you split the set of files up into cohorts (subsets of that set), give the cohort a name and append the cohort name to the input path. So

Cohort X:

geo/a/1.csv
geo/a/2.csv

becomes

matrices/a__X/matrix.mtx.gz
matrices/a__X/barcodes.gz
matrices/a__X/genes.gz

Cohort Y:

geo/a/3.csv
geo/a/4.csv

becomes

matrices/a__Y/matrix.mtx.gz
matrices/a__Y/barcodes.gz
matrices/a__Y/genes.gz

If there was no a subdir,

Cohort X:

geo/1.csv
geo/2.csv

becomes

matrices/X/matrix.mtx.gz
matrices/X/barcodes.gz
matrices/X/genes.gz

Cohort Y:

geo/3.csv
geo/4.csv

becomes

matrices/Y/matrix.mtx.gz
matrices/Y/barcodes.gz
matrices/Y/genes.gz

Does that make sense?

@hannes-ucsc Shouldn't cohort Y yield:

matrices/a__Y/matrix.mtx.gz
matrices/a__Y/barcodes.gz
matrices/a__Y/genes.gz

? a is the full path to the directory containing the per-cell csv files (relative to the geo dir), Y is the name of the cohort, and __ is the separator?

Yes, that's correct. I made a copy paste error.

To avoid confusion, I'll edit my original comment.