JoungZhang2023 Controls
Closed this issue · 1 comments
Hello @stefanpeidli,
The JoungZhang2023 dataset, i.e. both the atlas and combinatorial files, seem to not have their control samples identified.
You already pointed this out in cell 15 here: ../tree/master/dataset_processing/notebooks/JoungZhang2023.ipynb
For the combinatorial case, there are sample that have the GFP ORFs identified. But still, no GFP- or mCherry-only samples.
The paper's supplementary material, specifically Table 1, sheet 1, lists the 3548 transcription factor ORF IDs and their corresponding gene and transcript names/IDs, as well as the GFP and mCherry controls: https://www.cell.com/cell/fulltext/S0092-8674(22)01470-2?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867422014702%3Fshowall%3Dtrue#supplementaryMaterial
However, in the JoungZhang2023_atlas.h5ad
file, the perturbation
column ranges from '0' to '3368', so some TF ORFs must be missing.
Interestingly, when I count the samples by 'perturbation` in the atlas data, '3368' and '3367' seem to be the one with the most cells.
joung_atlas.obs['perturbation'].value_counts()
Out[5]:
perturbation
3368 78274
3367 49539
591 13939
854 7559
260 7270
...
2284 1
502 1
1644 1
1642 1
499 1
This hints that these two perturbations may actually correspond to GFP and mCherry controls, and that these numbers were moved around such that it's no longer possible to map back to the TF ORFs.
I also checked the study's own GitHub page, but didn't find anything useful: https://github.com/fengzhanglab/Joung_TFAtlas_Manuscript/tree/main
Any ideas on how to get the correct perturbation names and identify controls? The .obs.index
seems to have the cell IDs, so perhaps there's another resource e.g. on GSE that we can use to map the cells to the TFs identified in each, but I really don't know where to begin.
Thanks!
I think I figured this out. The subsample file at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE217460 has the correct annotations. To annotate the full atlas, I did the following:
- From the subsample file, extract the cell ID (
.obs.index
) and TF ID.obs['TF]
pairs. - Map/merge these pairs with the atlas data based on cell IDs.
- Now we have a mapping of
perturbation
in the atlas andTF
from the pairs from step 1, so create a dict from these. - Use the dict to propagate the labels to the rest of the samples.
With this method, only 146 cells across 103 perturbations will remain unmatched, which we can ignore.
If anyone is confused or needs help, you can contact me directly.
@stefanpeidli Let me know if you want me to open a pull request. This data source is useful, but only as useful as how well it's maintained.