A problem for batch effect removed data
MenghuaZhang86 opened this issue · 11 comments
Hello, I have tried TooManyCells in a single case(four samples from the same case), and the result looks very nice. So now I want to try it for multiple cases.
However, when I ran make-tree
with the output of cellranger
for 17 cases, I got a strange result. I guess it might be caused by the batch effect and I want to run make-tree
with batch effect corrected data. I would like to know:
- The batch effect corrected data is generated by
Harmony
, which is integrated intoSeurat
pipeline. The data is in.rds
format and there is no expression data inharmony
result. So in this case, I would like to know how to prepare the input data formake-tree
. - It looks that TooManyCells is memory-consuming. In my first running (four samples from the same case, about 16,000 cells), I did not meet any problem; but in my second running (17 cases, about 190,000 cells), I found that 300GB memory was not enough,
make-tree
halted without warning nor error message. I do not know if my guessing is correct, so I would like to know your comment.
Thank you very much!
Sorry for the delay, was busy and forgot to reply!
-
That's a bit tricky, as I only generally only use positive values. However, you can try to use Harmony by making sure not to filter cells and not using any normalization (
--normalization NoneNorm
). You can also try--shift-positive
as a last resort to translate the data to positive numbers. No guarantee that it will work, though. Without Harmony, you can also try different normalization procedures which may be the issue. -
Yes, TooManyCells is memory-consuming as it uses all features. If you use feature selection beforehand to reduce the number of features, you can significantly decrease the time (i.e. I have been able to run ~1 million cells in < 1 hour on our setup with CyTOF and IMC data).
Hi, @MenghuaZhang86 , I also have problems as to how to prepare the input matrix for TooManyCells after using Harmony to remove the batch effect. Have you figured out a way? Also, Is it possible to use the Harmony embedding data as the input for TooManyCells?
Thank you for your suggestions.
I still have some unclear points. I would appreciate it if you can tell me how to deal with them.
- There are two tables in
Harmony
result: one is cell_barcode(row)×harmony_cluster(column), another is gene×harmony_cluster. Maybe I can use the table(cell_barcode×harmony_cluster) as acsv
input formake-tree
, and turn off--filter-thresholds
,--normalization
, and--lsa
because all these steps have been finished before or during the running ofHarmony
, right? But in this case, gene expression data is not included, how can I add the gene expression to the result? - I notice another way to generate the tree by using
cluster-tree
and plot it bybirch-beer
. In this case, I can generate my similarity matrix utilizing theHarmony
result table(cell_barcode×harmony_cluster), right? However, I can not find the documentation ofcluster-tree
and there are about 190,000 cells in my data, so the similarity matrix will be very large. Also, I would like to know if I can add the gene expression data to the result in this method.
Thank you very much!
@yinlisssss I still have several problems with it. Let us wait for the reply and try later :)
I think with Harmony
you should be using the embeddings, correct? Then you have the cells with their transformed values. You can use the original gene expression matrix to plot gene expression afterwards.
@GregorySchwartz
Yes, I can use the harmony embeddings as input: the cell_barcode×harmony_embeddings
matrix. Another matrix is gene×harmony_embeddings
(sorry for the previous mistake).
I have tried the following script, but I do not know if --shift-positive
is necessary because I can get the result.
docker run --rm -v /data:/data --memory=300g --name toomanycells gregoryschwartz/too-many-cells:2.0.0.0 make-tree \
--matrix-path /data/TooManyCells/harmony.csv \
--labels-file /data/TooManyCells/labels.csv \
--draw-collection "PieChart" \
--output /data/TooManyCells/out_harmony \
--dendrogram-output out.pdf \
> /data/TooManyCells/out_clusters.csv
BTW, would you tell me how to plot the gene expression? I can check the result by checking the expression data.
Thank you for your great help!
Hopefully it looks good as I am not sure with the altered embeddings, but hey if it work then great!
For gene expression, you can use the gene expression matrix and it should match the cell barcodes with the prior tree structure (with --prior
).
I successfully plotted the gene expression by the following script:
docker run --rm -v /data:/data --memory=300g --name toomanycells gregoryschwartz/too-many-cells:2.0.0.0 make-tree \
--prior /data/TooManyCells/out_harmony \
--matrix-path /data/TooManyCells/normalized_exp.csv \
--labels-file /data/TooManyCells/labels.csv \
--draw-leaf "DrawItem (DrawContinuous [\"CD4\"])" \
--output /data/TooManyCells/out_harmony_exp \
--dendrogram-output CD4.pdf \
> /data/TooManyCells/out_clusters.csv
However, I have to note that I failed to create a complete expression csv
file because it is too large to generate such a big matrix in R
, so I just put several genes for result check into the csv
file.
I also tried to create cellranger
output-like files, matrix.mtx
, features.tsv
, and barcodes.tsv
. Each file contains the information of all samples. But when I run the above script with --matrix-path /data/TooManyCells/harmony_exp_input
, the directory contains cellranger
output-like files, I got the following error:
too-many-cells: readMatrix : File parse error: Failed reading: takeWhile1
So I guess this method does not work here, and I can not apply the following analysis, such as differential gene in TooManyCells, because I can only input expression data of selected genes.
I do not think this is a big problem, because I can do differential expression gene analysis in other software such as Seurat
, if I can get the list of cell barcodes in each leaf. Therefore I would like to know:
- Would you tell me how to get the list of cell barcodes in each leaf? In other words, how to cut the tree?
- The expression of several cell markers shows that the result is generally OK, but the expression of some ambiguous genes makes me a bit confusing. I do not know if this was caused by the absence of
--shift-positive
in my first running to get the tree because there are negative values inHarmony embeddings
(please see my script posted in the previous discussion).
Thank you very much!
@yinlisssss I hope my description can help you.
The error you received tells me that there may be something weird in the /data/TooManyCells/harmony_exp_input
folder. Is it a matrix or the TooManyCells output? I assume the parse error may also be because you have features.tsv
instead of genes.tsv
, as newer versions of cellranger make a features.tsv.gz
file instead (so you mixed together output formats).
-
You can grep, for instance, the
out_clusters.csv
file (or thenode_info.csv
output file for any node) for the node of interest to get the list of barcodes from a node. -
I don't know what your input data is or what the output tree is, so I don't know what you mean by this.
Thank you very much for your patient explanation!
What I have done is: first, run TooManyCells with a CSV
file of the cell_barcodes(row)×harmony_embeddings(column)
matrix to make the tree. Second, plot gene expression by another CSV
file which contains expression data of only selected genes because it is too large to make the complete expression matrix in R
. (You can find my scripts in my previous posts in this issue)
In the second step, I also tried to make cellranger
output-like files as TooManyCells input. I have changed features.tsv
into genes.tsv
, but it still does not work. I am not sure if there is any mismatching among matrix.mtx
, genes.tsv
, and barcodes.tsv
because I just extracted them from a Seurat
object which contains the information of all samples. The normalized, batch effect removed expression data in matrix.mtx
is written in Matrix Market Exchange Formats
directly.
The genes.tsv
contains two columns, both are gene symbols in the same order with the row names of the Seurat
object.
Similarly, the barcodes.tsv
contains all cell barcodes in the same order as the column names of the Seurat
object.
In the second question I mentioned in previous post, I want to know if negative values affect the generation of the tree because the input CSV
file of the cell_barcode×harmony_embeddings
matrix in the first step contains negative values. Why I want to know this is because the usage of TooManyCells mentioned:
If you do upstream batch effect correction, LSA, normalization, or anything else, be sure to use --normalization NoneNorm (and --shift-positive for LSA) to avoid wrong filters and scalings
When you say that it "still does not work", do you mean you get the same error? Did you try using a CSV as input instead of the cellranger input? With regards to negative values, in theory it should work but I am unsure if there may be circumstances that would mess up the spectral clustering and / or modularity calculation.
Thank you very much for your patient explanation!
What I have done is: first, run TooManyCells with a
CSV
file of thecell_barcodes(row)×harmony_embeddings(column)
matrix to make the tree. Second, plot gene expression by anotherCSV
file which contains expression data of only selected genes because it is too large to make the complete expression matrix inR
. (You can find my scripts in my previous posts in this issue)In the second step, I also tried to make
cellranger
output-like files as TooManyCells input. I have changedfeatures.tsv
intogenes.tsv
, but it still does not work. I am not sure if there is any mismatching amongmatrix.mtx
,genes.tsv
, andbarcodes.tsv
because I just extracted them from aSeurat
object which contains the information of all samples. The normalized, batch effect removed expression data inmatrix.mtx
is written inMatrix Market Exchange Formats
directly. Thegenes.tsv
contains two columns, both are gene symbols in the same order with the row names of theSeurat
object. Similarly, thebarcodes.tsv
contains all cell barcodes in the same order as the column names of theSeurat
object.In the second question I mentioned in previous post, I want to know if negative values affect the generation of the tree because the input
CSV
file of thecell_barcode×harmony_embeddings
matrix in the first step contains negative values. Why I want to know this is because the usage of TooManyCells mentioned:If you do upstream batch effect correction, LSA, normalization, or anything else, be sure to use --normalization NoneNorm (and --shift-positive for LSA) to avoid wrong filters and scalings
Hi, Zhang
Can you send me the harmony.csv file? Are the harmony_embeddings(column) you referred to in harmony tuned PC or the embeddings in UMAP space?
Thanks a lot.