This repository aims to share the raw data processing and visualization codes used in Spatial_epigenome-transcriptome_co-sequencing project.
Next Generation Sequencing (NGS) was performed using the Illumina NovaSeq 6000 sequencer (pair-end 150 bp mode).
Read 1: contains the spatial Barcode A and Barcode B Read 2: contains the genome sequences
Raw read 1 -> New Read 1 + New Read 2
- New Read 1: contains the genome sequences
- New Read 2: contains the spatial Barcode A and Barcode B
Raw read 2 -> New Read 3
Reformatting raw data was implemented by BC_process.py in the Data_preprocessing folder.
The reformated data was processed using Cell Ranger ATAC v1.2 with following references: Mouse reference (mm10):
curl -O https://cf.10xgenomics.com/supp/cell-atac/refdata-cellranger-atac-mm10-1.2.0.tar.gz
Human reference (GRCh38):
curl -O https://cf.10xgenomics.com/supp/cell-atac/refdata-cellranger-atac-GRCh38-1.2.0.tar.gz
A preprocessing pipeline we developed using Snakemake workflow management system is in the Data_preprocessing folder. To run the pipeline, use the command:
sbatch Snakemake.sh
1. Raw Fastq data processing using ST pipeline and generate expression matrix
The DBiT-seq Raw fastq file Read 1: Contains the cDNA sequence Read 2: Contains the spatial Barcode A, Barcode B and UMIs
2.Reformat Fastq Read 2 file To reformat the Raw data, run the fastq_process.py in Rawdata_processing folder and gzip the resulted fastq file to save space:
python fastq_process.py
gzip sample_R2_processed.fastq
The reformated data was processed following ST pipeline.
3、Run ST pipeline Run st_pipeline.sh to start the ST pipeline: The input is processed_R2.fastq.gz and Raw R1.fastq.gz. It also requires a "spatial_barcodes_index.txt" to decode the spatial location information. Genome references and annotatation files were aslo needed.
#!/bin/bash
# FASTQ reads
FW=$tmp/${sample}_R2_processed.fastq
RV=$tmp/${sample}_raw_qc_R1.fastq.gz
# References for mapping, annotation and nonRNA-filtering
MAP=/Dropseq_Alignment_References/mm10/
ANN=/Dropseq_Alignment_References/mm10/mm10.gtf
CONT=/Spatial_omics_references/mouse/GRCm38_86/ncRNA/StarIndex/
# Barcodes settings
ID=/useful/spatial_barcodes.txt
# Output folder and experiment name
OUTPUT=../output/
mkdir -p $OUTPUT
TMP_ST=$OUTPUT/tmp
mkdir -p $TMP_ST
# Running the pipeline
st_pipeline_run.py \
--output-folder $OUTPUT \
--ids $ID \
--ref-map $MAP \
--ref-annotation $ANN \
--expName $sample \
--htseq-no-ambiguous \
--verbose \
--log-file $OUTPUT/${sample}_log.txt \
--demultiplexing-kmer 5 \
--threads 20 \
--temp-folder $TMP_ST \
--no-clean-up \
--umi-start-position 16 \
--umi-end-position 26 \
--demultiplexing-overhang 0 \
--min-length-qual-trimming 20 \
$FW $RV
4.Convert Ensemble to Gene Names Then, Run converttoname.sh to annotate the resulting sample_stdata.tsv.
#!/bin/bash
tsv_E=$OUTPUT/${sample}_stdata.tsv
path_to_annotation_file=/Dropseq_Alignment_References/mm10/mm10.gtf
convertEnsemblToNames.py --annotation $path_to_annotation_file --output $OUTPUT/${sample}_stdata_names.tsv $tsv_E
link: https://github.com/edicliuyang/DBiT-seq_FFPE/tree/master/Figure_Processing
The data visualization were completed with R language. The package used extensively the functions in Signac V1.6, Seurat V4.0 and ArchR v1.0.2.