Gene expression levels and Transcript expression levels don't correlate

Question

Gene expression levels and Transcript expression levels don't correlate

Closed this issue 2 months ago · 1 comments

pratarora commented 3 months ago

Operating System

Ubuntu 22.04

Other Linux

No response

Workflow Version

Nextflow version version 23.10.1, build 5891

Workflow Execution

Command line (Local)

Other workflow execution

No response

EPI2ME Version

dev version

CLI command run

nextflow run epi2me-labs/wf-single-cell --fastq /mnt/disk2/Carla_seq/LRONT_samples/CI_1/CI_1.pass.fastq.gz --sample CI_1 --out_dir nextflow_results_CI_1_test --kit_name 3prime --kit_version v3 --ref_genome_dir /mnt/disk2/Carla_seq/reference_genomes/Ensembl/Danio_rerio/GRCz11/v111/v111_10X/ --merge_bam True --matrix_max_mito 60 -profile standard --max_threads 20 --resources_mm2_max_threads 20 --resources_mm2_flags '-I 48G' --mito_prefix mt-

Workflow Execution - CLI Execution Profile

standard (default)

What happened?

Hi,

We have been trying the workflow and came across a problem. If a gene has only one transcript, according to us the gene and transcript levels should match (since gene levels are made up of all the transcripts that make up the gene). However, we don't see that. We saw this in a couple of individual cases (plotted as violin plot -- they are from raw counts not normalized or anything) and then checked it across all the genes (with one transcript) and saw the values don't correlate (scatter plots -- they are from raw counts not normalized or anything).
We went through the documentation on how the genes are called but couldn't understand how the genes are called:

Gene assignment
For each read, the alignment with the highest genomic MAPQ score is used for gene assignemnt if this score > gene_assigns_minqv (default 30). The gene ID is then derived from the transcript assigned in the transcript assignment step.

Is the gene count coming from minimap and transcript from stringtie? could this be where the discrepancy arises?
Could there be some other place we are missing something?

Relevant log output

no logs

Application activity log entry

No response

Were you able to successfully run the latest version of the workflow with the demo data?

yes

Other demo data information

No response

Answer 1 · 2024-05-16T14:17:23.000Z

There are known wrinkles in how the transcript and gene labels are assigned. We intend to change how these are derived to make things more transparent.

The logic for how it is performed is here: https://github.com/epi2me-labs/wf-single-cell/blob/master/bin/workflow_glue/assign_features.py#L197.

The code was optimised for performance in the last release but the algorithm was not changed. The assignment of genes is done independently from transcripts, which naturally can lead to inconsistencies for margin-calls.