sanderlab/scPerturb

JoungZhang2023 Controls

Closed this issue · 1 comments

Hello @stefanpeidli,

The JoungZhang2023 dataset, i.e. both the atlas and combinatorial files, seem to not have their control samples identified.
You already pointed this out in cell 15 here: ../tree/master/dataset_processing/notebooks/JoungZhang2023.ipynb

For the combinatorial case, there are sample that have the GFP ORFs identified. But still, no GFP- or mCherry-only samples.

The paper's supplementary material, specifically Table 1, sheet 1, lists the 3548 transcription factor ORF IDs and their corresponding gene and transcript names/IDs, as well as the GFP and mCherry controls: https://www.cell.com/cell/fulltext/S0092-8674(22)01470-2?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867422014702%3Fshowall%3Dtrue#supplementaryMaterial

However, in the JoungZhang2023_atlas.h5ad file, the perturbation column ranges from '0' to '3368', so some TF ORFs must be missing.
Interestingly, when I count the samples by 'perturbation` in the atlas data, '3368' and '3367' seem to be the one with the most cells.

joung_atlas.obs['perturbation'].value_counts()
Out[5]: 
perturbation
3368    78274
3367    49539
591     13939
854      7559
260      7270
        ...  
2284        1
502         1
1644        1
1642        1
499         1

This hints that these two perturbations may actually correspond to GFP and mCherry controls, and that these numbers were moved around such that it's no longer possible to map back to the TF ORFs.

I also checked the study's own GitHub page, but didn't find anything useful: https://github.com/fengzhanglab/Joung_TFAtlas_Manuscript/tree/main

Any ideas on how to get the correct perturbation names and identify controls? The .obs.index seems to have the cell IDs, so perhaps there's another resource e.g. on GSE that we can use to map the cells to the TFs identified in each, but I really don't know where to begin.

Thanks!

I think I figured this out. The subsample file at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE217460 has the correct annotations. To annotate the full atlas, I did the following:

  1. From the subsample file, extract the cell ID (.obs.index) and TF ID .obs['TF] pairs.
  2. Map/merge these pairs with the atlas data based on cell IDs.
  3. Now we have a mapping of perturbation in the atlas and TF from the pairs from step 1, so create a dict from these.
  4. Use the dict to propagate the labels to the rest of the samples.

With this method, only 146 cells across 103 perturbations will remain unmatched, which we can ignore.

If anyone is confused or needs help, you can contact me directly.

@stefanpeidli Let me know if you want me to open a pull request. This data source is useful, but only as useful as how well it's maintained.