nf-core/quantms

Problems with OpenMS-style experimental design input

hendrikweisser opened this issue ยท 23 comments

Description of the bug

Following advice from @jpfeuffer, I'm making the switch from the "nf-core/proteomicslfq" workflow to "nf-core/quantms" (dev version in both cases). I'm surprised to find that my "experimental_design.tsv" file that works for "proteomicslfq" doesn't work for "quantms". Contents of the file are included below.

If I name the file "experimental_design.tsv" and place it in the directory where I run the workflow, the file gets cleared (!) and I get the first error message (see below). The cause appears to be the sed command that attempts to change the file in-place.

If I give the file a different name (e.g. "experimental_design_test.tsv"), I get the second error message (see below).

Command used and terminal output

$ nextflow run nf-core/quantms -r dev -profile docker --input experimental_design.tsv --database ../Human_reference_proteome_TD.fasta
[...]
Error executing process > 'NFCORE_QUANTMS:QUANTMS:CREATE_INPUT_CHANNEL:PREPROCESS_EXPDESIGN'

Caused by:
  Process `NFCORE_QUANTMS:QUANTMS:CREATE_INPUT_CHANNEL:PREPROCESS_EXPDESIGN` terminated with an error exit status (1)

Command executed:

  sed 's/.raw\t/.mzML\t/I' experimental_design.tsv > experimental_design.tsv
  a=$(grep -n '^$' experimental_design.tsv | head -n1| awk -F":" '{print $1}'); sed -e ''"${a}"',$d' experimental_design.tsv > process_experimental_design.tsv

Command exit status:
  1

Command output:
  (empty)


$ nextflow run nf-core/quantms -r dev -profile docker --input experimental_design_test.tsv --database ../Human_reference_proteome_TD.fasta
[...]
Cannot invoke method contains() on null object

 -- Check script '/home/hendrik.weisser/.nextflow/assets/nf-core/quantms/./workflows/../subworkflows/local/create_input_channel.nf' at line: 137 or see '.nextflow.log' file for more details

Relevant files

experimental_design.tsv:

Fraction_Group	Fraction	Spectra_Filepath	Label	Sample
1	1	/home/hendrik.weisser/Data/Proteomics/SST/mzML_files/20220223e_JR_SST_02.mzML	1	1
2	1	/home/hendrik.weisser/Data/Proteomics/SST/mzML_files/20220228a_JR_SST_02.mzML	1	2
3	1	/home/hendrik.weisser/Data/Proteomics/SST/mzML_files/20220301a_JR_SST_03.mzML	1	3
4	1	/home/hendrik.weisser/Data/Proteomics/SST/mzML_files/20220301a_JR_SST_05.mzML	1	4

Sample	MSstats_Condition	MSstats_BioReplicate
1	before	1
2	before	2
3	after	1
4	after	2

System information

Nextflow version: 21.10.6
nf-core/quantms revision: 63dd266 [dev]
OS: Ubuntu 20.04.4 LTS

@daichengxin What is this line doing? Please comment and fix.

We also need tests for experimental designs.

We also need tests for experimental designs.

I agree - that would be very helpful.

The thing is we have limited test data. How big is your data and is it public'?

Only the file part is extracted here, because we need to use nextflow to parse this file in the follow-up. There is a problem with the analysis of the experimental design file of the two tables for nextflow.

ch_in_design.splitCsv(header: true, sep: '\t')

if (!file(filestr).exists()) {

Is process_experimental_design.tsv generated? @hendrikweisser. Because The test case for the experimental design file was run successfully. So I am not sure about your problem.

I don't think an OpenMS experimental design needs to be parsed. It can only be one acquisition/label type.
Just take all files that you extracted as a channel and give them the same meta information from the parameters.

But now that you are mentioning it: @hendrikweisser I think in proteomicslfq "dev" we also removed support for the two-table design as pipeline input because of ease of parsing.

The thing is we have limited test data. How big is your data and is it public?

Unfortunately it's not public. Could you just use OpenMS test data (e.g. BSA)? Or use the same test data as now (from PRIDE), just replace the SRDF file with an "experimental_design.tsv"?

Is process_experimental_design.tsv generated?

Yes, in "null/preprocess" in the directory where I started the workflow. Is it expected that a directory called "null" is created, or is that already a sign that something went wrong?

No, You need to set the outdir parameter, otherwise a result directory named null will be generated by default

outdir = null

I think in proteomicslfq "dev" we also removed support for the two-table design as pipeline input because of ease of parsing.

No, it worked with the two-table design, but not with the one-table design (failed during "pmultiqc", see nf-core/proteomicslfq#186).

No, You need to set the outdir parameter, otherwise a result directory named null will be generated by default

Okay. I would suggest using e.g. "results" as the default value.

No, it worked with the two-table design, but not with the one-table design (failed during "pmultiqc", see nf-core/proteomicslfq#186).

This is weird, unless it accidentally does the right thing and only parses the first part.
https://github.com/nf-core/proteomicslfq/blob/69a0b9fc99bdc82faaae36f5e830dbfef52927b5/main.nf#L111

Okay. I would suggest using e.g. "results" as the default value.

This does not work on clouds. I just made it required.

I've tried it with the BSA test data from OpenMS (the test case for the ProteomicsLFQ tool) and got the same error (Cannot invoke method contains() on null object). The call:

nextflow run nf-core/quantms -r dev -profile docker --input BSA_design.tsv --database /home/hendrik.weisser/Software/OpenMS/share/OpenMS/examples/TOPPAS/data/BSA_Identification/18Protein_SoCe_Tr_detergents_trace_target_decoy.fasta --outdir results

The experimental design file is here: https://github.com/OpenMS/OpenMS/blob/develop/share/OpenMS/examples/FRACTIONS/BSA_design.tsv
(I've adjusted it to use absolute paths to the mzML files on my system.)

The mzML files are in this folder: https://github.com/OpenMS/OpenMS/tree/develop/share/OpenMS/examples/FRACTIONS

The FASTA file is here: https://github.com/OpenMS/OpenMS/blob/develop/share/OpenMS/examples/TOPPAS/data/BSA_Identification/18Protein_SoCe_Tr_detergents_trace_target_decoy.fasta

The error actually happens in this line:

if (meta.labelling_type.contains("tmt") || meta.labelling_type.contains("itraq") || meta.labelling_type.contains("label free")) {

Is meta.labelling_type not set if the input isn't in SDRF format? (I was wondering how the workflow determines whether to use LFQ or TMT or DIA, since I didn't see a parameter for that.)

If I add --labelling_type "label free sample" to the call, I now get the same error in a different line:

ch_meta_config_dia: it[0].acquisition_method.contains("dia")

Could you set acquisition_method to dda?

Could you set acquisition_method to dda?

Yes, now it's running!

So with "experimental_design.tsv", the parameters "labelling_type" and "acquisition_method" are required.

Yes. Because it is difficult to deduce this information from the experimental design file. Will update to enforce these parameters. Thanks for your test to improve quantms

Btw not require additional input of this information when sdrf as input

fixed in 4f19874