scAnalyzer

The source code for scAnalyzer (single and cell Analyzer). It performs clustering and differential analyses on spatial proteomics data.

Tutorial

I. Clustering

i. Perform hierarchical clustering

Agglomerative clustering performed using the schist nested model.
The schist nested model achieves a more fine-grained clustering as compared to Leiden or schist planted model, achieved over several clustering levels--thereby giving the user higher control over cluster analyses, and how many clusters to stop the clustering algorithm at.

Running the clustering algorithm:

Open command prompt/ terminal, then run:

python3 1_compute_list_of_essential_proteins_for_clustering.py <adata_pickle_path> --user_selected_cluster_level <user_selected_cluster_level>

The positional arguments are:

[1] adata_pickle_path                   Description: Specify path to anndata pickle data; type=str

The optional arguments are:

[1] --user_selected_cluster_level       Description: Specify cluster level to get essential proteins list from; type=str [read more below]

{any_integer_value, "complete"} accepted, default is 'complete':
    - If value is integer positive: That represents the cluster level
    - If value is integer negative: That represents the cluster number in reverse order (for example: -1 represents (n-1)-th cluster)
    - If value is 'complete': That represents the last (n-th) cluster. Please note that if your input is "complete", just type it in normally without inverted commas/ quotation marks -- the program is capable of reading strings directly without any quotation symbols.
    - For all other values (i. negative integers; ii. fractional values; iii. any strings other than "complete"; iv. integer value is out of range of [1, n] where n is the total number of clusters): the default is n.

The outputs are:

List of essential proteins per cluster per patient
Dotplot of potential marker genes vs. cell type clusters (example below for patient id: 1 of the TNBC MIBI dataset)

ii. Protein importance:

In this section, we compute and visualize importances of essential proteins that drive clustering results.

Running the protein importance computation:

Open command prompt/ terminal, then run:

python3 2_visualize_importances_of_essential_proteins_for_clustering.py <adata_pickle_path> <essential_proteins_per_patient_pickle_path>

The positional arguments are:

[1] adata_pickle_path                                       Description: Specify path to anndata pickle data; type=str
[2] essential_proteins_per_patient_pickle_path              Description: Specify path to essential proteins dictionary generated from the clustering (previous step); type=str

The outputs are:

Feature importance scores calculated using permutation importance on one-vs-all classifier for patient id: 1 in the TNBC MIBI dataset:
Feature importance scores calculated using Gini index on random forest (RF) classifier for patient id: 1 in the TNBC MIBI dataset:
Feature importance scores calculated using permutation importance on random forest (RF) classifier for patient id: 1 in the TNBC MIBI dataset:

II. Cluster analyses (deciding on the optimal cluster level)

i. Maximum bipartite matching between clusterings:

The similarity scores between clusterings have been obtained for all patients, and their distributions have been plotted against cluster level in the schist nested model.

Running the protein importance computation:

Open command prompt/ terminal, then run:

python3 3_plot_of_robustness_vs_intercluster_similarity.py <adata_pickle_path> --method <method>

The positional arguments are:

[1] adata_pickle_path                   Description: Specify path to anndata pickle data; type=str

The optional arguments are:

[1] --method                            Description: Specify cluster agreement method between clusterings; type=str; options={"top-down", "bottom-up"}; default='top-down' [read more below]

Note: The schist hierarchical clustering method is agglomerative, meaning bottom-up.

The outputs are:

Plot of robustness vs. inter-cluster similarity (example below for patient id: 1 of the TNBC MIBI dataset):

(A) Top-down cluster mapping:

(B) Bottom-up cluster mapping:

You can decide on the optimal cluster level based on these plots.
For our case study on the TNBC MIBI dataset dataset, we will just set optimal cluster level $n = N-1$, where $N$ is the highest cluster level in the schist agglomerative model, where there is just one cluster.

ii. Measure no. of clusters per cluster level:

This may further help some users decide how many cluster levels to set for the schist agglomerative clustering model.

Running the computation for no. of clusters per cluster level:

Open command prompt/ terminal, then run:

python3 4_plot_no_of_cluster_levels_vs_no_of_clusters_per_cluster_level.py <adata_pickle_path> <clusterings_patientLevel_dict_path>

The positional arguments are:

[1] adata_pickle_path                                                     Description: Specify path to anndata pickle data; type=str
[2] clusterings_patientLevel_dict_path                                    Specify path to clustering combinations calculated in the previous step (saved as 'clusterings_patientLevel_dict.pkl' in step **II.i. Maximum bipartite matching between clusterings**)

The outputs are:

Plot of no. of clusters per cluster level across all samples in the TNBC MIBI dataset:

III. Differential analyses (before cell type annotations)

i. Protein correlations

Generate protein correlation matrix.
Subsequently, perform MWU-test on protein correlation values between conditions.
Retain correlations with p-values<0.05 as important protein-protein correlations.

Obtaining the essential protein correlations:

Open command prompt/ terminal, then run:

python3 5_DA_protein_correlations.py <adata_pickle_path> <dependent_variable_name>

The positional arguments are:

[1] adata_pickle_path                   Description: Specify path to anndata pickle data; type=str
[2] dependent_variable_name             Description: Specify the name of the dependent variable; type=str
    
Note: The aforementioned variable <dependent_variable_name> should be present under observation metadata (obsm) of the anndata onject containing gene/ protein expression.

The outputs are:

Bar plot of significant protein correlation p-values

ii. Check multiple protein profiles

Count all cells containing all N-combinaions of proteins present above the threshold.
Subsequently, perform MWU-test on no. of cells protein profiles expressed in, across conditions.
Retain protein profiles with p-values<0.05 as significant protein profiles.

Obtaining the essential protein correlations:

Open command prompt/ terminal, then run:

python3 6_DA_multiple_protein_profiles.py <adata_pickle_path> <dependent_variable_name> --N <N> --threshold <threshold>

The positional arguments are:

[1] adata_pickle_path                   Description: Specify path to anndata pickle data; type=str
[2] dependent_variable_name             Description: Specify the name of the dependent variable; type=str
    
Note: The aforementioned variable <dependent_variable_name> should be present under observation metadata (obsm) of the anndata onject containing gene/ protein expression.

The optional arguments are:

[1] --N                         Description: Specify number of proteins to be analyzed within protein profile; type=int, default=2
[2] --threshold                 Description: Specify minimum value over which protein expressions are considered for further analyses; type=float, default=0.5

The outputs are:

Bar plot of significant, multiple protein coexpression p-values

Nothing found for N=2, threshold=0.5, across all patients in the TNBC MIBI dataset.

bionetslab/TCL-project

scAnalyzer

Tutorial

I. Clustering

i. Perform hierarchical clustering

Running the clustering algorithm:

ii. Protein importance:

Running the protein importance computation:

II. Cluster analyses (deciding on the optimal cluster level)

i. Maximum bipartite matching between clusterings:

Running the protein importance computation:

ii. Measure no. of clusters per cluster level:

Running the computation for no. of clusters per cluster level:

III. Differential analyses (before cell type annotations)

i. Protein correlations

Obtaining the essential protein correlations:

ii. Check multiple protein profiles

Obtaining the essential protein correlations:

Under development:

IV. Differential analyses (after cell type annotations)

i. Neighborhood enrichment, MWU-test

ii. Co-occurence, MWU-test

iii. Centrality scores

iv. Ripley's statistics

v. Spatial autocorrelation