GregorySchwartz/too-many-cells

A problem for batch effect removed data

MenghuaZhang86 opened this issue · 11 comments

Hello, I have tried TooManyCells in a single case(four samples from the same case), and the result looks very nice. So now I want to try it for multiple cases.
However, when I ran make-tree with the output of cellranger for 17 cases, I got a strange result. I guess it might be caused by the batch effect and I want to run make-tree with batch effect corrected data. I would like to know:

  1. The batch effect corrected data is generated by Harmony, which is integrated into Seurat pipeline. The data is in .rds format and there is no expression data in harmony result. So in this case, I would like to know how to prepare the input data for make-tree.
  2. It looks that TooManyCells is memory-consuming. In my first running (four samples from the same case, about 16,000 cells), I did not meet any problem; but in my second running (17 cases, about 190,000 cells), I found that 300GB memory was not enough, make-tree halted without warning nor error message. I do not know if my guessing is correct, so I would like to know your comment.

Thank you very much!

Sorry for the delay, was busy and forgot to reply!

  1. That's a bit tricky, as I only generally only use positive values. However, you can try to use Harmony by making sure not to filter cells and not using any normalization (--normalization NoneNorm). You can also try --shift-positive as a last resort to translate the data to positive numbers. No guarantee that it will work, though. Without Harmony, you can also try different normalization procedures which may be the issue.

  2. Yes, TooManyCells is memory-consuming as it uses all features. If you use feature selection beforehand to reduce the number of features, you can significantly decrease the time (i.e. I have been able to run ~1 million cells in < 1 hour on our setup with CyTOF and IMC data).

Hi, @MenghuaZhang86 , I also have problems as to how to prepare the input matrix for TooManyCells after using Harmony to remove the batch effect. Have you figured out a way? Also, Is it possible to use the Harmony embedding data as the input for TooManyCells?

Hi @GregorySchwartz

Thank you for your suggestions.
I still have some unclear points. I would appreciate it if you can tell me how to deal with them.

  1. There are two tables in Harmony result: one is cell_barcode(row)×harmony_cluster(column), another is gene×harmony_cluster. Maybe I can use the table(cell_barcode×harmony_cluster) as a csv input for make-tree, and turn off --filter-thresholds, --normalization, and --lsa because all these steps have been finished before or during the running of Harmony, right? But in this case, gene expression data is not included, how can I add the gene expression to the result?
  2. I notice another way to generate the tree by using cluster-tree and plot it by birch-beer. In this case, I can generate my similarity matrix utilizing the Harmony result table(cell_barcode×harmony_cluster), right? However, I can not find the documentation of cluster-tree and there are about 190,000 cells in my data, so the similarity matrix will be very large. Also, I would like to know if I can add the gene expression data to the result in this method.

Thank you very much!

@yinlisssss I still have several problems with it. Let us wait for the reply and try later :)

I think with Harmony you should be using the embeddings, correct? Then you have the cells with their transformed values. You can use the original gene expression matrix to plot gene expression afterwards.

@GregorySchwartz
Yes, I can use the harmony embeddings as input: the cell_barcode×harmony_embeddings matrix. Another matrix is gene×harmony_embeddings (sorry for the previous mistake).

I have tried the following script, but I do not know if --shift-positive is necessary because I can get the result.

docker run --rm -v /data:/data --memory=300g --name toomanycells gregoryschwartz/too-many-cells:2.0.0.0 make-tree \
   --matrix-path /data/TooManyCells/harmony.csv \
   --labels-file /data/TooManyCells/labels.csv \
   --draw-collection "PieChart" \
   --output /data/TooManyCells/out_harmony \
   --dendrogram-output out.pdf \
   > /data/TooManyCells/out_clusters.csv

BTW, would you tell me how to plot the gene expression? I can check the result by checking the expression data.

Thank you for your great help!

Hopefully it looks good as I am not sure with the altered embeddings, but hey if it work then great!

For gene expression, you can use the gene expression matrix and it should match the cell barcodes with the prior tree structure (with --prior).

Hi @GregorySchwartz

I successfully plotted the gene expression by the following script:

docker run --rm -v /data:/data --memory=300g --name toomanycells gregoryschwartz/too-many-cells:2.0.0.0 make-tree \
	--prior /data/TooManyCells/out_harmony \
	--matrix-path /data/TooManyCells/normalized_exp.csv \
	--labels-file /data/TooManyCells/labels.csv \
	--draw-leaf "DrawItem (DrawContinuous [\"CD4\"])" \
	--output /data/TooManyCells/out_harmony_exp \
	--dendrogram-output CD4.pdf \
	> /data/TooManyCells/out_clusters.csv

However, I have to note that I failed to create a complete expression csv file because it is too large to generate such a big matrix in R, so I just put several genes for result check into the csv file.
I also tried to create cellranger output-like files, matrix.mtx, features.tsv, and barcodes.tsv. Each file contains the information of all samples. But when I run the above script with --matrix-path /data/TooManyCells/harmony_exp_input, the directory contains cellranger output-like files, I got the following error:
too-many-cells: readMatrix : File parse error: Failed reading: takeWhile1
So I guess this method does not work here, and I can not apply the following analysis, such as differential gene in TooManyCells, because I can only input expression data of selected genes.

I do not think this is a big problem, because I can do differential expression gene analysis in other software such as Seurat, if I can get the list of cell barcodes in each leaf. Therefore I would like to know:

  1. Would you tell me how to get the list of cell barcodes in each leaf? In other words, how to cut the tree?
  2. The expression of several cell markers shows that the result is generally OK, but the expression of some ambiguous genes makes me a bit confusing. I do not know if this was caused by the absence of --shift-positive in my first running to get the tree because there are negative values in Harmony embeddings(please see my script posted in the previous discussion).

Thank you very much!

@yinlisssss I hope my description can help you.

The error you received tells me that there may be something weird in the /data/TooManyCells/harmony_exp_input folder. Is it a matrix or the TooManyCells output? I assume the parse error may also be because you have features.tsv instead of genes.tsv, as newer versions of cellranger make a features.tsv.gz file instead (so you mixed together output formats).

  1. You can grep, for instance, the out_clusters.csv file (or the node_info.csv output file for any node) for the node of interest to get the list of barcodes from a node.

  2. I don't know what your input data is or what the output tree is, so I don't know what you mean by this.

Thank you very much for your patient explanation!

What I have done is: first, run TooManyCells with a CSV file of the cell_barcodes(row)×harmony_embeddings(column) matrix to make the tree. Second, plot gene expression by another CSV file which contains expression data of only selected genes because it is too large to make the complete expression matrix in R. (You can find my scripts in my previous posts in this issue)

In the second step, I also tried to make cellranger output-like files as TooManyCells input. I have changed features.tsv into genes.tsv, but it still does not work. I am not sure if there is any mismatching among matrix.mtx, genes.tsv, and barcodes.tsv because I just extracted them from a Seurat object which contains the information of all samples. The normalized, batch effect removed expression data in matrix.mtx is written in Matrix Market Exchange Formats directly.
Screen Shot 0003-09-10 at 20 11 12
The genes.tsv contains two columns, both are gene symbols in the same order with the row names of the Seurat object.
genes
Similarly, the barcodes.tsv contains all cell barcodes in the same order as the column names of the Seurat object.
Screen Shot 0003-09-10 at 20 11 38

In the second question I mentioned in previous post, I want to know if negative values affect the generation of the tree because the input CSV file of the cell_barcode×harmony_embeddings matrix in the first step contains negative values. Why I want to know this is because the usage of TooManyCells mentioned:

If you do upstream batch effect correction, LSA, normalization, or anything else, be sure to use --normalization NoneNorm (and --shift-positive for LSA) to avoid wrong filters and scalings

When you say that it "still does not work", do you mean you get the same error? Did you try using a CSV as input instead of the cellranger input? With regards to negative values, in theory it should work but I am unsure if there may be circumstances that would mess up the spectral clustering and / or modularity calculation.

Thank you very much for your patient explanation!

What I have done is: first, run TooManyCells with a CSV file of the cell_barcodes(row)×harmony_embeddings(column) matrix to make the tree. Second, plot gene expression by another CSV file which contains expression data of only selected genes because it is too large to make the complete expression matrix in R. (You can find my scripts in my previous posts in this issue)

In the second step, I also tried to make cellranger output-like files as TooManyCells input. I have changed features.tsv into genes.tsv, but it still does not work. I am not sure if there is any mismatching among matrix.mtx, genes.tsv, and barcodes.tsv because I just extracted them from a Seurat object which contains the information of all samples. The normalized, batch effect removed expression data in matrix.mtx is written in Matrix Market Exchange Formats directly. Screen Shot 0003-09-10 at 20 11 12 The genes.tsv contains two columns, both are gene symbols in the same order with the row names of the Seurat object. genes Similarly, the barcodes.tsv contains all cell barcodes in the same order as the column names of the Seurat object. Screen Shot 0003-09-10 at 20 11 38

In the second question I mentioned in previous post, I want to know if negative values affect the generation of the tree because the input CSV file of the cell_barcode×harmony_embeddings matrix in the first step contains negative values. Why I want to know this is because the usage of TooManyCells mentioned:

If you do upstream batch effect correction, LSA, normalization, or anything else, be sure to use --normalization NoneNorm (and --shift-positive for LSA) to avoid wrong filters and scalings

Hi, Zhang
Can you send me the harmony.csv file? Are the harmony_embeddings(column) you referred to in harmony tuned PC or the embeddings in UMAP space?
Thanks a lot.