Is matrixValidity printing row and column wrong?
Opened this issue · 17 comments
I'm trying out a dataset with 333 cells and 60329 genes.
> head input/matrix.mtx
%%MatrixMarket matrix coordinate real general
333 60329 2995863
40 1 1.064
72 1 1.036
152 1 15
1 2 2
128 2 16
258 2 1
40 9 116.936
72 9 12.964
When running:
too-many-cells make-tree \
--matrix-path input \
--output out
"Warning: mismatch in number of (features, cells) (60329,333) with matrix (rows, columns) (333,60329), will probably result in error."..................................] 0%
too-many-cells: matMat : incompatible matrix sizes((333,60329),(88,1))
CallStack (from HasCallStack):
error, called at src/Data/Sparse/SpMatrix.hs:793:22 in sparse-linear-algebra-0.3.2-8Sr5Y9guRLx7MwdljauHcO:Data.Sparse.SpMatrix
I wasn't sure what the error meant and checked the code. It looks like its printing the matrix's cols,rows instead of rows,cols as the message says. However, The matrix does have 333 rows and 60329 columns so I'm not sure how to interpret it or know if my matrix is set up wrong. I'm also not sure where the (88, 1) means.
-- | Check validity of matrix.
matrixValidity :: (MatrixLike a) => a -> Maybe String
matrixValidity mat
| rows /= numCells || cols /= numFeatures =
Just $ "Warning: mismatch in number of (features, cells) ("
<> show numFeatures
<> ","
<> show numCells
<> ") with matrix (rows, columns) ("
<> show cols
<> ","
<> show rows
<> "), will probably result in error."
| otherwise = Nothing
where
(rows, cols) = S.dimSM . getMatrix $ mat
numCells = V.length . getRowNames $ mat
numFeatures = V.length . getColNames $ mat
I'm having issues where not all 333 cells are in the final clusters.csv output - fewer than 100 make it depending on the filter-thresholds
used. Since the documentation said this was optional, I took it out, but found this error.
I've made a draft PR swapping the two prints if it is actually wrong.
Thanks for catching the error in the error!
You are right, the actual error is in the (88, 1) portion. Usually this is when filtering or input is unexpected in some way. For instance, you say that you have 333 cells but they are rows in the input matrix. Cell ranger's matrix market output has cells as columns, so that is what we used for input, could you check that your barcode and features file are matching that? If not, you can also transpose the matrix with the --matrix-transpose
argument, but you need to make sure those other files match.
I had initially tried cells as columns, but only got one cell in the output with the default filter thresholds:
clusters.csv
:
cell,cluster,path
SE6052_SA56912_S1_L001_R1_001.merged_quant,0,0
I tried different threshold combinations, but still only had this one cell in the output.
Not including filter thresholds gives:
too-many-cells: matMat : incompatible matrix sizes((60329,333),(1,1))...........................] 0%│
CallStack (from HasCallStack): │
error, called at src/Data/Sparse/SpMatrix.hs:793:22 in sparse-linear-algebra-0.3.2-8Sr5Y9guRLx7MwdljauHcO:Data.Sparse.SpMatrix
However, using the previous cells as rows and genes as columns mtx format, I ran
too-many-cells make-tree \
--matrix-path input \
--matrix-transpose \
--output out
This listed the genes instead of the barcodes as cells in its output so I swapped the genes and barcodes filenames so that genes.tsv
contains the cell ids and barcodes.tsv
contains the genes. Then when running the previous command, I got all 333 cells in the output and a dendrogram.
clusters.csv
:
cell,cluster,path
SE6052_SA57031_S120_L001_R1_001.merged_quant,3,3/2/1/0
SE6054_SA56854_S135_L002_R1_001_quant,3,3/2/1/0
SE6054_SA56903_S184_L002_R1_001_quant,3,3/2/1/0
SE6052_SA56932_S21_L001_R1_001.merged_quant,5,5/4/2/1/0
SE6052_SA56952_S41_L001_R1_001.merged_quant,5,5/4/2/1/0
...
I'm not sure where I've gone wrong, but doing it the opposite, but correct way doesn't work: matrix with cells as columns and genes as rows, genes.tsv
with gene ids, and barcodes.tsv
with cell ids - incompatible matrix size error.
Can I see a head
of each file (matrix, features, barcodes) with each file name?
For the actual case (non-transposed).
==> barcodes.tsv <==
SE6052_SA56912_S1_L001_R1_001.merged_quant
SE6052_SA56914_S3_L001_R1_001.merged_quant
SE6052_SA56915_S4_L001_R1_001.merged_quant
SE6052_SA56916_S5_L001_R1_001.merged_quant
SE6052_SA56917_S6_L001_R1_001.merged_quant
SE6052_SA56918_S7_L001_R1_001.merged_quant
SE6052_SA56919_S8_L001_R1_001.merged_quant
SE6052_SA56920_S9_L001_R1_001.merged_quant
SE6052_SA56921_S10_L001_R1_001.merged_quant
SE6052_SA56922_S11_L001_R1_001.merged_quant
==> genes.tsv <==
ENSG00000223972
ENSG00000243485
ENSG00000284332
ENSG00000268020
ENSG00000240361
ENSG00000186092
ENSG00000233750
ENSG00000241599
ENSG00000279928
ENSG00000286448
==> matrix.mtx <==
%%MatrixMarket matrix coordinate real general
60329 333 2995863
2 1 2
15 1 138.588
16 1 155.831
17 1 61.541
18 1 1
20 1 854.483
21 1 2.003
25 1 35.331
> wc -l barcodes.tsv
333 barcodes.tsv
> wc -l genes.tsv
60329 genes.tsv
And can I see the command you ran with those files along with the error? Is it the first comment of this thread?
Yes, with this data it was
too-many-cells make-tree \
--matrix-path input \
--output out
Adding this
too-many-cells make-tree \
--filter-thresholds "(250, 1)" \
--matrix-path input \
--output out
removes the error giving one cell in the output
I'm starting to confuse myself. I've reverted the pull request as I realized I was referring to the rows as features and cells as columns like with Cell Ranger (even though for the program our convention is cells as rows). Hence the swap.
For your problem, this means that the original error is that the feature file had 60329 rows but the matrix had 333 rows (and vice versa for columns). Based on what you sent me (the wc -l
for each), I don't know why this would happen. As you can see in
Silly question, what is in the input
folder?
What happens if you do not change your features and barcode files but use -T
?
Also, you should not use that "default" filtering threshold as your values are definitely not 10x scRNA-seq, just leave them at (0,0)
for now.
Could you also send the tail
of each file?
Silly question, what is in the
input
folder?
> ls input
barcodes.tsv
genes.tsv
matrix.mtx
Could you also send the tail of each file?
==> barcodes.tsv <==
SE6054_SA56902_S183_L002_R1_001_quant
SE6054_SA56903_S184_L002_R1_001_quant
SE6054_SA56904_S185_L002_R1_001_quant
SE6054_SA56905_S186_L002_R1_001_quant
SE6054_SA56906_S187_L002_R1_001_quant
SE6054_SA56907_S188_L002_R1_001_quant
SE6054_SA56908_S189_L002_R1_001_quant
SE6054_SA56909_S190_L002_R1_001_quant
SE6054_SA56910_S191_L002_R1_001_quant
SE6054_SA56911_S192_L002_R1_001_quant
==> genes.tsv <==
ENSG00000277761
ENSG00000277836
ENSG00000275869
ENSG00000273554
ENSG00000278633
ENSG00000278066
ENSG00000276017
ENSG00000278817
ENSG00000277196
ENSG00000278625
==> matrix.mtx <==
60270 333 2
60271 333 26
60273 333 68
60274 333 436
60275 333 3
60278 333 393
60279 333 258
60287 333 11
60288 333 1.999
60291 333 2
What happens if you do not change your features and barcode files but use
-T
?
too-many-cells make-tree \
--matrix-path input \
-T \
--output input
too-many-cells: matMat : incompatible matrix sizes((333,60329),(155,1)).......................] 0%
CallStack (from HasCallStack):
error, called at src/Data/Sparse/SpMatrix.hs:793:22 in sparse-linear-algebra-0.3.2-8Sr5Y9guRLx7MwdljauHcO:Data.Sparse.SpMatrix
If its helpful, the data are from a new scRNA-seq protocol and I'm using the data from that paper. I got a SingleCellExperiment object and pulled the count matrix from it and wrote it to a mtx file.
I'm a little confused, in the first comment in this thread you had cells as rows, but in the latest one you have them as columns. Which is the original and which error goes with which matrix?
Just to be clear, in a perfect world where it works, the matrix should have 333 columns and the barcode file with 333 lines, no transposing, and filters being 0.
Try testing on the example from the TooManyCells workshop to see if it has the appropriate output, and see if the inputs match yours.
The first comment had cells as rows and genes as columns and produced that error about mismatched dimensions (incompatible matrix sizes((333,60329),(88,1))
). I did not create the matrix correctly as too-many-cells expected.
The last comment from me is using the same data, but, as expected, with cells as columns and genes as rows. The perfect world scenario also produced an error with different mismatched dimensions incompatible matrix sizes((333,60329),(155,1))
.
In both cases, using the filters-thresholds flag removes the error. When the matrix is set up as too-many-cells expects, only the first cell is present in the output regardless of the filter thresholds values.
Try testing on the example from the TooManyCells workshop to see if it has the appropriate output, and see if the inputs match yours.
I ran this and it worked perfectly. There is probably something up with my matrix - I'm having a hard time tracking down the source of the error. As far as I can tell the inputs match in mtx header and the counts of features and barcodes.
My comment with the dendrogram picture was the only time I did a transpose of the 'wrong' matrix format (like my first comment) and swapped genes/barcodes filenames. Its also the only one that produces results without filters.
Sorry for the confusion and thanks for taking the time to go through this!
The only difference I found between the workshop data and mine is the numeric type - real vs integer:
mine:
%%MatrixMarket matrix coordinate real general
60329 333 2995863
2 1 2
15 1 138.588
16 1 155.831
17 1 61.541
18 1 1
20 1 854.483
21 1 2.003
25 1 35.331
wc -l genes.tsv
60329
wc -l barcodes.tsv
333
workshop brain:
%%MatrixMarket matrix coordinate integer general
%metadata_json: {"format_version": 2, "software_version": "3.0.0"}
31053 1301 4220492
30976 1 73
30974 1 1
30973 1 56
30972 1 2
30971 1 11
30970 1 116
30969 1 123
zcat too-many-cells/workshop/data/brain/filtered_feature_bc_matrix/barcodes.tsv.gz | wc -l
1301
zcat too-many-cells/workshop/data/brain/filtered_feature_bc_matrix/features.tsv.gz | wc -l
31053
Very weird. What if you use a csv? It will be slower but we can see if it has something to do with the mtx file which I think is the culprit.
That worked! The tree also makes more biological sense than the transposed approach. I'll try converting between csv/mtx and figure out where my mtx file is broken.