/scHLApers

Code to run the scHLApers pipeline for personalized single-cell HLA quantification

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

scHLApers

Code to run the scHLApers pipeline for quantifying single-cell HLA expression using personalized reference genomes (Kang et al., Nat Genetics 2023). Overview

Requirements

R program requires (listed version or higher):

  • R=4.0.5
  • Biostrings=2.58.0
  • purrr=0.3.4
  • readr=2.1.2
  • stringi=1.7.8
  • stringr=1.4.0
  • tidyverse=1.3.1
  • rtracklayer=1.50.0

Other software:

Data:

  • Reference genome (e.g. GRCh38.primary_assembly.genome.fa): available here
  • Gene annotation file (e.g. gencode.v38.annotation.gtf): available here
  • Cell barcode whitelist: more info here

Pipeline and example data

Each step has its own directory with necessary scripts and a tutorial walking through the steps. The example_data and example_output directories contain example input and output files for 2 samples. The raw scRNA-seq data for the example was obtained from Yazar et al. Science 2022 study, publicly available on GEO (GSE196830).

Input

The inputs to scHLApers are:

  • Raw scRNA-seq data (either FASTQ or BAM format)
  • HLA allele calls (in CSV format, labeled as "SampleX_alleles.csv", see example_data/inputs/alleles for format)

See the HLA analyses tutorial from Sakaue et al. for protocol for imputing HLA alleles from genotype array data.

Step 1: Prepare HLA allelic sequence database

We provide a pre-prepared database generated from IPD-IMGT/HLA version 3.47 that can be directly used in Step 2. Alternatively, you can prepare your own database using the latest IPD-IMGT/HLA verison following the tutorial.

Step 2: Make personalized reference and annotation files

The tutorial demonstrates how to generate personalized contigs (FASTA) and annotations (GTF) files (that will be combined with the masked reference) and how to mask the reference.

Step 3: Quantify single-cell expression with STARsolo

Example scripts for how to run STARsolo for read alignment and expression quantification in single-cell data. Script will need to be modified based on the specifics of your dataset (e.g. UMI length, input format, barcode whitelist path, STAR executable). Please see the STAR manual for all options.

Outputs

The output of scHLApers is a genes by cells expression matrix, with improved classical HLA expression estimates. In the example output, we have filtered the raw STARsolo counts matrix (to remove empty droplets) using a provided list of cell barcodes (see example_data/cell_meta_example.csv).

The raw counts matrix output by the pipeline for example Sample_1006_1007 can be found here: ../example_outputs/STARsolo_results/Sample_1006_1007_scHLApers/Sample_1006_1007_scHLApers_Solo.out/GeneFull_Ex50pAS/raw/UniqueAndMult-EM.mtx

A filtered version is located here (read into R using readRDS): ../example_outputs/STARsolo_results/Sample_1006_1007_scHLApers/exp_EM.rds

Note: The classical HLA genes are named IMGT_A, IMGT_C, IMGT_B, IMGT_DRB1, IMGT_DQA1, IMGT_DQB1, IMGT_DPA1, IMGT_DPB1.

Support

For questions and assistance not answered in tutorials, you can contact Joyce Kang (joyce_kang AT hms.harvard DOT edu).

Reproducing results from the manuscript

Code to reproduce the figures and analyses from Kang et al. will become available at https://github.com/immunogenomics/hla2023.