epi2me-labs/wf-single-cell

Spiked hg38 ref aligns but does not report spike-in gene/transcript features

Closed this issue · 5 comments

Operating System

Ubuntu 22.04

Other Linux

No response

Workflow Version

v1.0.1

Workflow Execution

Command line

EPI2ME Version

No response

CLI command run

From config file:
{
"help": false,
"version": false,
"fastq": "400bps_full/run.fastq",
"out_dir": "output",
"sample_sheet": null,
"sample": null,
"single_cell_sample_sheet": null,
"aws_image_prefix": null,
"aws_queue": null,
"disable_ping": false,
"kit_config": null,
"max_threads": 16,
"plot_umaps": true,
"ref_genome_dir": "hg38+spike/",
"barcode_adapter1_suff_length": 10,
"barcode_min_quality": 15,
"barcode_max_ed": 2,
"barcode_min_ed_diff": 2,
"gene_assigns_minqv": 30,
"matrix_min_genes": 200,
"matrix_min_cells": 1,
"matrix_max_mito": 25,
"matrix_norm_count": 10000,
"umap_plot_genes": null,
"umap_n_repeats": 6,
"resources_mm2_max_threads": 16,
"resources_mm2_flags": "-I 16G",
"process_chunk_size": 100000,
"adapter_scan_chunk_size": 0,
"kit_name": "5prime",
"kit_version": "v1",
"expected_cells": 1000,
"merge_bam": false,
"mito_prefix": "MT-",
"stringtie_opts": "-c 1",
"monochrome_logs": false,
"validate_params": true,
"show_hidden_params": false,

Workflow Execution - CLI Execution Profile

standard (default)

What happened?

We are using a custom hg38 reference containing spiked-in exogenous genes/transcripts, generated using 10x's mkref tool and using their custom ref instructions. Our spike-in gene/transcripts are found in the geneInfo.tab and transcriptInfo.tab STAR files within the custom 10x reference folder, indicating that the reference was created properly.

When we run the wf-single-cell pipeline using this custom reference, we observe mapping of individual reads to the spike-in reference contigs (we are able to see this in the .read_tags output file). However, the wf-single-cell pipeline fails to subsequently mark any of these reads as associated spike-in gene or transcript feature counts. Simultaneously, transcripts from the hg38 genome are being successfully annotated as features in the same run.

While we typically observe at least several UMI counts of spike-in transcript coverages per cell, as a further sanity check, we also set stringtie to c=1 and matrix_min_cells=1 to prevent any read coverage transcript filtering; however this had no effect on the feature output.

When we dig deeper into the 'work' scratch folder, we see that a feature tag file is being generated for each of the spike-in fasta contigs; however, the file is always empty. We additionally see that the ref_genes.bed file generated by the wf-single-cell pipeline does not contain any of our spike-in genes.

We cannot figure out what is going wrong here?

For reference, here is one line from the end of our gtf file.

TCR2 unknown exon 1 1370 . + . gene_id "TCR2"; transcript_id "TCR2"; gene_name "TCR2"; gene_biotype "protein_coding"; transcript_type "protein_coding"; transcript_name "TCR2";

Relevant log output

N/A

Application activity log entry

No response

Hi @Fractal10

IS there some mismatch between the reference genome sequence and your gtf. Is there a sequence in your hg38+spike fasta with the the ID TCR2?

Hi @nrhorner

There is no mismatch between the fasta reference sequence and the gtf. Yes, there is a sequence in the hg38+spike fasta with the id TCR2.

In one of the work subdirectories, I see that individual GTFs (with genes/transcripts) are being created for each of the fasta ref contigs, including the spike-ins. However, those spike-in genes are not subsequently appearing in the ref_genes.bed file in another work sub-folder.

Hi @Fractal10

Apologies for the late reply. If you are still encountering tis issue, please could you try again with the latest version

This version can be selected by adding the following to the command -r v2.0.2

Thanks

@Fractal10 Also would you be able to send be a full GTF entry for one of your spike-ins?

Closing due to lack of response