Problems with v2.0.0

Question

Problems with v2.0.0

Closed this issue 3 months ago · 5 comments

ddiez commented 4 months ago

Operating System

Other Linux (please specify below)

Other Linux

Ubuntu 23.10

Workflow Version

v2.0.0

Workflow Execution

Command line (Local)

Other workflow execution

No response

EPI2ME Version

No response

CLI command run

nextflow run epi2me-labs/wf-single-cell
-w workspace
-r master
-profile standard
--fastq wf-single-cell-demo/fastq/A/chr17.fq.gz
--kit_name 3prime
--kit_version v3
--expected_cells 500
--ref_genome_dir ~/10x/refdata-gex/refdata-gex-GRCh38-2020-A
--out_dir single-cell-demo-out_latest_single
--umap_n_repeats 1

Workflow Execution - CLI Execution Profile

standard (default)

What happened?

First of all, congrats for a great v2 release that has so many improvements. Now the pipeline runs quickly and successfully on all my datasets. Thanks for the great work. I have found a couple of problems that I detail here:

Feature ids in features.tsv.gz are all characterized as unknown, instead of including, I assume, the ref_gene_id (ensembl):

unknown_00000   AANAT   Gene Expression
unknown_00001   AAR2    Gene Expression
unknown_00002   AARSD1  Gene Expression
unknown_00003   AASDHPPT        Gene Expression
unknown_00004   AATF    Gene Expression
unknown_00005   AATK    Gene Expression
unknown_00006   ABCA10  Gene Expression
unknown_00007   ABCA5   Gene Expression
unknown_00008   ABCA6   Gene Expression

A minor problem is that in the UMAPs showing mitochondrial percentage all values are zero. Also, in the file *.expression.mito-per-cell.tsv the value for mito_pct is 0 for all barcodes. I confirm that mitochondrial genes are found in the features.tsv.gz file:

unknown_01181   MT-ATP6 Gene Expression
unknown_01182   MT-CO1  Gene Expression
unknown_01183   MT-CO2  Gene Expression
unknown_01184   MT-CO3  Gene Expression
unknown_01185   MT-CYB  Gene Expression
unknown_01186   MT-ND1  Gene Expression
unknown_01187   MT-ND2  Gene Expression
unknown_01188   MT-ND3  Gene Expression
unknown_01189   MT-ND4  Gene Expression
unknown_01190   MT-ND4L Gene Expression
unknown_01191   MT-ND5  Gene Expression

I have found these problems in my own datasets too, both in human and mouse samples.

Relevant log output

N E X T F L O W  ~  version 23.10.1
Launching `https://github.com/epi2me-labs/wf-single-cell` [lonely_dubinsky] DSL2 - revision: 5690e2a1b7 [master]

||||||||||   _____ ____ ___ ____  __  __ _____      _       _
||||||||||  | ____|  _ \_ _|___ \|  \/  | ____|    | | __ _| |__  ___
|||||       |  _| | |_) | |  __) | |\/| |  _| _____| |/ _` | '_ \/ __|
|||||       | |___|  __/| | / __/| |  | | |__|_____| | (_| | |_) \__ \
||||||||||  |_____|_|  |___|_____|_|  |_|_____|    |_|\__,_|_.__/|___/
||||||||||  wf-single-cell v2.0.0-g5690e2a
--------------------------------------------------------------------------------
Core Nextflow options
  revision       : master
  runName        : lonely_dubinsky
  containerEngine: docker
  container      : [withLabel:singlecell:ontresearch/wf-single-cell:sha0fcdf10929fbef2d426bb985e16b81153a88c6f4, withLabel:wf_common:ontresearch/wf-common:sha91cd87900c86f05bf36d8c77b841b8fda5ecf3aa]
  launchDir      : /home/diez/tmp/ont
  workDir        : /home/diez/tmp/ont/single-cell-demo-out2/workspace
  projectDir     : /home/diez/.nextflow/assets/epi2me-labs/wf-single-cell
  userName       : diez
  profile        : standard
  configFiles    : /home/diez/.nextflow/assets/epi2me-labs/wf-single-cell/nextflow.config

Input Options
  fastq          : wf-single-cell-demo/fastq/A/chr17.fq.gz
  ref_genome_dir : /home/diez/10x/refdata-gex/refdata-gex-GRCh38-2020-A

Output Options
  out_dir        : single-cell-demo-out_latest_single

Advanced options
  umap_n_repeats : 1

!! Only displaying parameters that differ from the pipeline defaults !!
--------------------------------------------------------------------------------
If you use epi2me-labs/wf-single-cell for your analysis please cite:

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x


--------------------------------------------------------------------------------
This is epi2me-labs/wf-single-cell v2.0.0-g5690e2a.
--------------------------------------------------------------------------------
Searching input for [.fastq, .fastq.gz, .fq, .fq.gz] files.
executor >  local (158)
[8e/3ce0e5] process > fastcat (1)                                       [100%] 1 of 1 ✔
[5a/099b09] process > parse_kit_metadata (1)                            [100%] 1 of 1 ✔
[a4/3ea687] process > pipeline:getVersions                              [100%] 1 of 1 ✔
[2b/9d6510] process > pipeline:getParams                                [100%] 1 of 1 ✔
[74/0af583] process > pipeline:preprocess:call_paftools                 [100%] 1 of 1 ✔
[3b/f46bc4] process > pipeline:preprocess:get_chrom_sizes               [100%] 1 of 1 ✔
[72/f31cab] process > pipeline:preprocess:build_minimap_index           [100%] 1 of 1 ✔
[c4/4cd1df] process > pipeline:preprocess:call_adapter_scan (1)         [100%] 1 of 1 ✔
[d1/0caf2b] process > pipeline:process_bams:split_gtf_by_chroms         [100%] 1 of 1 ✔
[77/c9e884] process > pipeline:process_bams:generate_whitelist (1)      [100%] 1 of 1 ✔
[7f/7dee20] process > pipeline:process_bams:assign_barcodes (1)         [100%] 1 of 1 ✔
[54/e65d95] process > pipeline:process_bams:cat_tags_by_chrom (1)       [100%] 1 of 1 ✔
[8f/90cc9c] process > pipeline:process_bams:merge_bams (1)              [100%] 1 of 1 ✔
[a6/47ac90] process > pipeline:process_bams:stringtie (24)              [100%] 40 of 40 ✔
[ad/dfed7a] process > pipeline:process_bams:align_to_transcriptome (40) [100%] 40 of 40 ✔
[2c/3b0e51] process > pipeline:process_bams:assign_features (28)        [100%] 28 of 28 ✔
[37/ddf44d] process > pipeline:process_bams:create_matrix (28)          [100%] 28 of 28 ✔
[5c/049086] process > pipeline:process_bams:process_matrix (1)          [100%] 2 of 2 ✔
[cd/a65a86] process > pipeline:process_bams:merge_transcriptome (1)     [100%] 1 of 1 ✔
[d8/da4c74] process > pipeline:process_bams:combine_final_tag_files (1) [100%] 1 of 1 ✔
[7a/9cbf40] process > pipeline:process_bams:tag_bam (1)                 [100%] 1 of 1 ✔
[ce/7f95e7] process > pipeline:process_bams:umi_gene_saturation (1)     [100%] 1 of 1 ✔
[48/cf8458] process > pipeline:process_bams:pack_images (1)             [100%] 1 of 1 ✔
[b7/71eff8] process > pipeline:prepare_report_data (1)                  [100%] 1 of 1 ✔
[37/a05a63] process > pipeline:makeReport (1)                           [100%] 1 of 1 ✔
Completed at: 10-May-2024 22:17:52
Duration    : 9m 50s
CPU hours   : 1.6
Succeeded   : 158

Application activity log entry

No response

Were you able to successfully run the latest version of the workflow with the demo data?

yes

Other demo data information

No response

Answer 1 · 2024-05-10T14:29:44.000Z

Feature ids in features.tsv.gz are all characterized as unknown, instead of including, I assume, the ref_gene_id (ensembl):

This is the expected behaviour currently. The code that writes out the MTX outputs doesn't have access to the data for that column of the features file. Rather than deviate from the 10X-style output we chose to stub out the column with the "unknown" text.

the value for mito_pct is 0 for all barcodes.

The genes will always be listed in the features file regardless of their abundances: the file constitutes an index for the sparse matrix in the MTX file.

Answer 2 · 2024-05-11T05:25:35.000Z

Feature ids in features.tsv.gz are all characterized as unknown, instead of including, I assume, the ref_gene_id (ensembl):

This is the expected behaviour currently. The code that writes out the MTX outputs doesn't have access to the data for that column of the features file. Rather than deviate from the 10X-style output we chose to stub out the column with the "unknown" text.

Ah, I apologize for the noise in this one. For some reason I thought the transcript_raw_feature_bc_matrix also contained gene symbols instead of transcript ids, which was my primary concern. I should be more careful and not submit issues when tired. Sorry about that. I guess it would be better if for gene_raw_feature_bc_matrix we had the ensembl gene ids instead of the unknown but I also agree this is better than deviating from 10x output.

the value for mito_pct is 0 for all barcodes.

The genes will always be listed in the features file regardless of their abundances: the file constitutes an index for the sparse matrix in the MTX file.

Yes, I understand this, but I fear I did not explain properly the issue. For example, in the demo data this is a sample of the counts for mitochondrial genes in gene_raw_feature_bc_matrix:

MT-ATP6 1 . 1 . . . . 1 . 1
MT-CO1  . 1 1 2 . 1 1 . 1 1
MT-CO2  . . 1 . 1 . . . . 1
MT-CO3  1 . 1 1 . . . 1 1 1
MT-CYB  . . . 1 . . . . . .
MT-ND1  . . . 2 . . . . . .
MT-ND2  . . . . . . . . 1 1
MT-ND3  . . . . . . . . . .
MT-ND4  . . . . . . . 1 . 1
MT-ND4L . . . . . . . . . .
MT-ND5  . . . . . . . . . .

In spite of this, the file gene.expression.mito-per-cell.tsv shows mito_pct of 0 for all barcodes. And the UMAP in the report showing mitochondrial pct content (wf-single-cell-report.html) shows all zero values.

I think this is only an issue with the report and perhaps the gene.expression.mito-per-cell.tsv file, since the mito data is correctly included in the matrix file we use for analysis.

Answer 3 · 2024-05-11T07:07:48.000Z

Sorry, I clicked send before writing everything I meant to write.

The code for handling and transforming the counts was almost entirely rewritten (twice! A first pass to rationalise memory use, a second for performance). It's very possible we've introduced a bug there. We need to add more tests to the code to catch this stuff!

We'll take a look at this early next week. (Be aware we released a patch v2.0.1 -- this does not contain a fix for this issue).

Answer 4 · 2024-05-11T07:37:11.000Z

@cjw85 thanks for letting me know! I will keep an eye on new versions.

Answer 5 · 2024-05-14T12:54:34.000Z

v2.0.2 should make its way to GitHub this afternoon and fixes the zeroes in the gene.expression.mito-per-cell.tsv file.