/PCASA

PCASA is a tool for predicting the best gene combination targets of surface antigens by classification of malignant and non-malignant cells using single-cell RNA-seq data.

Primary LanguageJupyter NotebookOtherNOASSERTION

PCASA

PCASA (Prioritization of Combinatorial Cancer-Associated Surface Antigens) is a tool for predicting the best gene combination targets of surface antigens by classification of malignant and non-malignant cells using single-cell RNA-seq data.

Requirements

Required packages for RF (Random forest) in R.

Required modules for CNN (Convolutional neural network) in Python.

  • Python (>= 3.6.9): https://www.python.org/downloads/source/
  • pip3 (>= 21.2.4), numpy (>= 1.19.5), matplotlib (>= 3.3.4), pandas (>= 1.1.5), scikit-learn (>= 0.24.1), keras (>= 2.6.0), tensorflow (>= 2.6.0), tf-keras-vis (>= 0.8.0)

Running codes

Overview

PCASA is a three-step program.

  • Step 1. Prioritization of the single genes well-classifying tumor and normal cells.
  • Step 2. Calculation of the expressing cell fraction (ECF) for each gene combination.
  • Step 3. Prioritization of the gene combinations well-classifying tumor and normal cells.

Input files

The first command for each step requires input data.

  • For step 1, a tab-delimited file containing cell-by-gene sparse matrix composed of log-transformed counts with a column showing the binary class (Tumor, 1; Normal, 0).
  • For steps 2 & 3, a two-column tab-delimited file containing cell-code and cell-type.
  • For steps 2 & 3, a tab-delimited file containing gene-by-cell sparse matrix composed of log-transformed counts.

Step1. RF - Cell classifier for single genes

cd code

Rscript step-1a__random_forest.R ../data/input-1__scrna_class.txt
# 'input-1__scrna_class.txt': Cell-by-Gene-Matrix-with-Class

python step-1b__random_forest.py

Step2. Expression logic evaluator for ECF

python step-2__gate_coverage_calc.py \
../data/input-2a__scrna_annotation.txt \
../data/input-2b__scrna_gc-matrix.txt
# 'input-2a__scrna_annotation.txt' : Cell-Type-Annotation
# 'input-2b__scrna_gc-matrix.txt' : Gene-by-Cell-sparse-Matrix

Step3. CNN - Cell classifier for gene combinations

python step-3a__cnn_gradcam.py \
../data/input-2a__scrna_annotation.txt \
../data/input-2b__scrna_gc-matrix.txt
# 'input-2a__scrna_annotation.txt' : Cell-Type-Annotation
# 'input-2b__scrna_gc-matrix.txt' : Gene-by-Cell-sparse-Matrix

python step-3b__cnn_gradcam.py
python step-3c__cnn_gradcam.py

Pre-process

The directory named 'preprocess' contains the codes for pre-processing.

  • Initial filtering, normalization, and dimensional reduction by Seurat
  • Reference-based cell-type annotation by SingleR
  • Aneuploid cell prediction by CopyKat
  • Volume-dependent sub-sampling by Geosketch
  • Batch correction by BBKNN
  • Geosketch and BBKNN codes are based on the module named 'scjp' that Dr.Park made manually.