
Do we need to correct the batch effects of given datasets

Opened this issue · 14 comments

Hi, thanks for your great work. I wonder if we need to correct the batch effects of these spatial transcriptomic data or not. Thanks a lot!

Hi, it depends on what you want to do with HEST data. What's your use case?

I am interested in the Visium data only. Thanks.

Visium data integrated into HEST-1k are very diverse: 2 species (mouse and human), multiple diseases, and organs. Batch effect correction should always be done if there are some guarantees that it won't significantly alter the biological signal.

To give a better answer, I need a better understanding of your problem statement, e.g., multimodal representation learning, ST prediction from H&E, characterization of morphological correlates of expression changes, etc.

If you want to explore batch effect, we implemented 2 core functions:

  • Batch effect visualization, here, which does a UMAP viz of the gene expression of housekeeping genes (ie stable genes) in the stromal region. The function can take as input a series of visium samples that you want to use.
  • Batch effect correction, here, which can correct batch effects using MNN, Harmony, and Combat. The output of each method is different, e.g., Harmony creates a new latent space, so the output cannot be interpreted as gene counts anymore (this may or may not be an issue for your problem statement)

Thanks! I will take a look at it!

@HelloWorldLTY, feel free to document any findings on this GitHub issue.

Related to this, I am noticing fairly strong batch effects by sample-of-origin for the H&E patch embeddings from Visium data even from the same tissue and disease. Is this to be expected or am I missing a key pre-processing step? I am loading in the patches using a H5HESTDataset object and applying only the model-specific eval_transforms (which generally appear to be resizing and ImageNet Normalization).

Batch effects in the H&E images exist. Why patch encoder are you using?

I see this with both the Gigapath and UNI encoders.

In my experience CONCH is less sensitive to staining variations. Also, keep in mind that the image latent space can express staining variations, while also encoding all the relevant biological signal. Depending on the downstream task, it may not be critical.

I see. Are there any ways to correct for the staining variations with preprocessing/normalization? It seems that Harmony can remove some of the image batch effects from the embeddings, but not all.

Many approaches exist for stain normalization in computational pathology, e.g., Macenko or Vahadane normalization. However, these can also alter the biological signal from the image. I'd need to better understand your problem statement to provide a more informed answer.

Got it! We were interested in predicting gene expression from the patch embeddings, but it seems from what you're saying that batch effect correction can hurt more than help for this task.

In HEST-Benchmark we didn't apply additional corrections. I'm sure that performance can be improved. But the big unknown becomes how to ensure good generalization.

Okay got it, thank you for the information!