Problems with OpenMS-style experimental design input
hendrikweisser opened this issue ยท 23 comments
Description of the bug
Following advice from @jpfeuffer, I'm making the switch from the "nf-core/proteomicslfq" workflow to "nf-core/quantms" (dev version in both cases). I'm surprised to find that my "experimental_design.tsv" file that works for "proteomicslfq" doesn't work for "quantms". Contents of the file are included below.
If I name the file "experimental_design.tsv" and place it in the directory where I run the workflow, the file gets cleared (!) and I get the first error message (see below). The cause appears to be the sed
command that attempts to change the file in-place.
If I give the file a different name (e.g. "experimental_design_test.tsv"), I get the second error message (see below).
Command used and terminal output
$ nextflow run nf-core/quantms -r dev -profile docker --input experimental_design.tsv --database ../Human_reference_proteome_TD.fasta
[...]
Error executing process > 'NFCORE_QUANTMS:QUANTMS:CREATE_INPUT_CHANNEL:PREPROCESS_EXPDESIGN'
Caused by:
Process `NFCORE_QUANTMS:QUANTMS:CREATE_INPUT_CHANNEL:PREPROCESS_EXPDESIGN` terminated with an error exit status (1)
Command executed:
sed 's/.raw\t/.mzML\t/I' experimental_design.tsv > experimental_design.tsv
a=$(grep -n '^$' experimental_design.tsv | head -n1| awk -F":" '{print $1}'); sed -e ''"${a}"',$d' experimental_design.tsv > process_experimental_design.tsv
Command exit status:
1
Command output:
(empty)
$ nextflow run nf-core/quantms -r dev -profile docker --input experimental_design_test.tsv --database ../Human_reference_proteome_TD.fasta
[...]
Cannot invoke method contains() on null object
-- Check script '/home/hendrik.weisser/.nextflow/assets/nf-core/quantms/./workflows/../subworkflows/local/create_input_channel.nf' at line: 137 or see '.nextflow.log' file for more details
Relevant files
experimental_design.tsv:
Fraction_Group Fraction Spectra_Filepath Label Sample
1 1 /home/hendrik.weisser/Data/Proteomics/SST/mzML_files/20220223e_JR_SST_02.mzML 1 1
2 1 /home/hendrik.weisser/Data/Proteomics/SST/mzML_files/20220228a_JR_SST_02.mzML 1 2
3 1 /home/hendrik.weisser/Data/Proteomics/SST/mzML_files/20220301a_JR_SST_03.mzML 1 3
4 1 /home/hendrik.weisser/Data/Proteomics/SST/mzML_files/20220301a_JR_SST_05.mzML 1 4
Sample MSstats_Condition MSstats_BioReplicate
1 before 1
2 before 2
3 after 1
4 after 2
System information
Nextflow version: 21.10.6
nf-core/quantms revision: 63dd266 [dev]
OS: Ubuntu 20.04.4 LTS
@daichengxin What is this line doing? Please comment and fix.
We also need tests for experimental designs.
We also need tests for experimental designs.
I agree - that would be very helpful.
The thing is we have limited test data. How big is your data and is it public'?
Only the file part is extracted here, because we need to use nextflow to parse this file in the follow-up. There is a problem with the analysis of the experimental design file of the two tables for nextflow.
Is process_experimental_design.tsv generated? @hendrikweisser. Because The test case for the experimental design file was run successfully. So I am not sure about your problem.
I don't think an OpenMS experimental design needs to be parsed. It can only be one acquisition/label type.
Just take all files that you extracted as a channel and give them the same meta information from the parameters.
But now that you are mentioning it: @hendrikweisser I think in proteomicslfq "dev" we also removed support for the two-table design as pipeline input because of ease of parsing.
The thing is we have limited test data. How big is your data and is it public?
Unfortunately it's not public. Could you just use OpenMS test data (e.g. BSA)? Or use the same test data as now (from PRIDE), just replace the SRDF file with an "experimental_design.tsv"?
Is process_experimental_design.tsv generated?
Yes, in "null/preprocess" in the directory where I started the workflow. Is it expected that a directory called "null" is created, or is that already a sign that something went wrong?
No, You need to set the outdir
parameter, otherwise a result directory named null will be generated by default
Line 172 in 63dd266
I think in proteomicslfq "dev" we also removed support for the two-table design as pipeline input because of ease of parsing.
No, it worked with the two-table design, but not with the one-table design (failed during "pmultiqc", see nf-core/proteomicslfq#186).
No, You need to set the outdir parameter, otherwise a result directory named null will be generated by default
Okay. I would suggest using e.g. "results" as the default value.
No, it worked with the two-table design, but not with the one-table design (failed during "pmultiqc", see nf-core/proteomicslfq#186).
This is weird, unless it accidentally does the right thing and only parses the first part.
https://github.com/nf-core/proteomicslfq/blob/69a0b9fc99bdc82faaae36f5e830dbfef52927b5/main.nf#L111
Okay. I would suggest using e.g. "results" as the default value.
This does not work on clouds. I just made it required.
I've tried it with the BSA test data from OpenMS (the test case for the ProteomicsLFQ tool) and got the same error (Cannot invoke method contains() on null object
). The call:
nextflow run nf-core/quantms -r dev -profile docker --input BSA_design.tsv --database /home/hendrik.weisser/Software/OpenMS/share/OpenMS/examples/TOPPAS/data/BSA_Identification/18Protein_SoCe_Tr_detergents_trace_target_decoy.fasta --outdir results
The experimental design file is here: https://github.com/OpenMS/OpenMS/blob/develop/share/OpenMS/examples/FRACTIONS/BSA_design.tsv
(I've adjusted it to use absolute paths to the mzML files on my system.)
The mzML files are in this folder: https://github.com/OpenMS/OpenMS/tree/develop/share/OpenMS/examples/FRACTIONS
The FASTA file is here: https://github.com/OpenMS/OpenMS/blob/develop/share/OpenMS/examples/TOPPAS/data/BSA_Identification/18Protein_SoCe_Tr_detergents_trace_target_decoy.fasta
The error actually happens in this line:
Is meta.labelling_type
not set if the input isn't in SDRF format? (I was wondering how the workflow determines whether to use LFQ or TMT or DIA, since I didn't see a parameter for that.)
If I add --labelling_type "label free sample"
to the call, I now get the same error in a different line:
Could you set acquisition_method
to dda
?
Could you set acquisition_method to dda?
Yes, now it's running!
So with "experimental_design.tsv", the parameters "labelling_type" and "acquisition_method" are required.
Yes. Because it is difficult to deduce this information from the experimental design file. Will update to enforce these parameters. Thanks for your test to improve quantms
Btw not require additional input of this information when sdrf as input