Swarbricklab-code/BrCa_cell_atlas

Question about Pseudobulk

Closed this issue · 1 comments

@dlroden
Hi,

In the issue #5 , you mentioned "For the Pseudobulk, we just summed up all the reads for each gene across all cells".
I'm not sure did you use the "sum" or the "average of cells of each sample"? Since I am wondering wouldn't the difference of cell amounts between samples influence the result? For example, sample A contains more cells than B, so in the pseudobulk sample A might contain more gene UMI than B not because of its real expression but of cells.
If you used the sum, what did you perform later to eliminate this influence?

Thank you a lot

Hi,
Thanks for your query.

Our aim was to treat pseudobulk as close as possible to a true bulk. Therefore, the aim was to get all the sequenced reads from a sample without adjusting those reads per cell populations. We have also found that using the raw R2 fastq files as input to a bulk RNAseq pipeline (i.e., without any UMI/cell information) gives comparable results to the count summation method. We haven't compared this to using the average of the cells in each sample.

When integrating with other bulk datasets (e.g., TCGA) we used quantile normalisation (see description here: #6 (comment)).

Hope this helps