Code to run the scHLApers pipeline for quantifying single-cell HLA expression using personalized reference genomes (Kang et al., Nat Genetics 2023).
R program requires (listed version or higher):
- R=4.0.5
- Biostrings=2.58.0
- purrr=0.3.4
- readr=2.1.2
- stringi=1.7.8
- stringr=1.4.0
- tidyverse=1.3.1
- rtracklayer=1.50.0
Other software:
- STAR=2.7.10a https://github.com/alexdobin/STAR
- samtools=1.4.1 http://www.htslib.org/download/
Data:
- Reference genome (e.g. GRCh38.primary_assembly.genome.fa): available here
- Gene annotation file (e.g. gencode.v38.annotation.gtf): available here
- Cell barcode whitelist: more info here
Each step has its own directory with necessary scripts and a tutorial walking through the steps. The example_data and example_output directories contain example input and output files for 2 samples. The raw scRNA-seq data for the example was obtained from Yazar et al. Science 2022 study, publicly available on GEO (GSE196830).
The inputs to scHLApers are:
- Raw scRNA-seq data (either FASTQ or BAM format)
- HLA allele calls (in CSV format, labeled as "SampleX_alleles.csv", see
example_data/inputs/alleles
for format)
See the HLA analyses tutorial from Sakaue et al. for protocol for imputing HLA alleles from genotype array data.
We provide a pre-prepared database generated from IPD-IMGT/HLA version 3.47 that can be directly used in Step 2. Alternatively, you can prepare your own database using the latest IPD-IMGT/HLA verison following the tutorial.
The tutorial demonstrates how to generate personalized contigs (FASTA) and annotations (GTF) files (that will be combined with the masked reference) and how to mask the reference.
Example scripts for how to run STARsolo for read alignment and expression quantification in single-cell data. Script will need to be modified based on the specifics of your dataset (e.g. UMI length, input format, barcode whitelist path, STAR executable). Please see the STAR manual for all options.
The output of scHLApers is a genes by cells expression matrix, with improved classical HLA expression estimates. In the example output, we have filtered the raw STARsolo counts matrix (to remove empty droplets) using a provided list of cell barcodes (see example_data/cell_meta_example.csv
).
The raw counts matrix output by the pipeline for example Sample_1006_1007
can be found here:
../example_outputs/STARsolo_results/Sample_1006_1007_scHLApers/Sample_1006_1007_scHLApers_Solo.out/GeneFull_Ex50pAS/raw/UniqueAndMult-EM.mtx
A filtered version is located here (read into R using readRDS
):
../example_outputs/STARsolo_results/Sample_1006_1007_scHLApers/exp_EM.rds
Note: The classical HLA genes are named IMGT_A
, IMGT_C
, IMGT_B
, IMGT_DRB1
, IMGT_DQA1
, IMGT_DQB1
, IMGT_DPA1
, IMGT_DPB1
.
For questions and assistance not answered in tutorials, you can contact Joyce Kang (joyce_kang AT hms.harvard DOT edu).
Code to reproduce the figures and analyses from Kang et al. will become available at https://github.com/immunogenomics/hla2023.