ctmrbio/stag-mwc

Make it possible to shortcut early steps

boulund opened this issue · 5 comments

I want to make it possible to take a shortcut in the StaG dependency graph between rules to enable skipping the early QC steps if you have data that have already been QC:ed and had host sequences removed.

My quick and dirty idea is to just let users specify any input folder and a filename pattern and make sure they copy or symlink their files there.

The only thing I think needs modification in the workflow are all preprocessing steps that assume input files are located in the input folder, and to indent all preprocessing rule definitions so they fall under the if-statement in respective rule file, so they are not evaluated if the user requests to skip them..

I just made a quick test and it seems to work. It might feel a bit hacky, but requires pretty much no modification of the workflow at all... Pretty neat.

This is now pushed to develop, but still requires a mention in the docs.

I didn't really think this through, and I think this introduces an annoying "feature": if the user sets qc_reads and remove_human to False in the config file, then the workflow will not find anything to run, and exit without any message. If they are not set to True, the rules are not included in the workflow and Snakemake can't find a DAG solution to produce the requested output files.

Perhaps we can work try to work around this by explaining how it works in the docs. We already have both these settings default to True in the config file, so hopefully most people won't be affected.

It would greatly simplify "short-cutting" the dependency graph if the output filenames from the fastp and host_removal steps are identical to the input filenames. I have previously held a strong standpoint that it is important to avoid the risk of confusion of the preprocessed files and the raw input files by giving the preprocessed files a different filename.

Currently, assuming the input files are called sample1_R{1,2}.fq.gz, then the intermediary files would be sample1_R{1,2}.qc.fq.gz for the fastp output files in the fastp output folder and sample1_R{1,2}.host_removal.fq.gz for the host_removal output files in the host_removal output folder.

The alternative is to just trust that the output folder names provides enough separation that it would be OK to call the intermediary files e.g. output_folder/fastp/sample1_R{1,2}.fq.gz and output_folder/host_removal/sample1_R{1,2}.fq.gz.

The goal of this would be to enable short-cutting the dependency graph, and thus make it possible for users (i.e. ourselves) to run StaG with a flexible starting point: either raw reads, or fastp-processed reads, or host_removed reads. This makes it easier for ourselves to start runs using data from already preprocessed flowcells, but also for other users that might have already run preprocessing and host removal beforehand.

What do you think about this @huyue87 @luhugerth @jwdebelius @StefPN?

@luhugerth says:

Is the proposal that all 3 files be called sample1_R{1,2}.fq.gz, but that they are in different folders? I can live with that. There is some potential for misunderstanding if people start symlinking to those files, but an ls -l is sufficient to clarify the confusion.... and the goal of enabling a shortcut into StaG to separate primary QC from project analysis is very worth this risk

@StefPN says:

I am, like you, hesitant about identical file names. Is It not possible that StaG could accept wildcards in the input file name, e.g sample1_R{1,2}*fq.gz. If not, I guess we need to use identical names in separate folders and trust people know what they are doing.

We pretty much agreed that having the same filenames in the different folders is not such a big problem.

The next step is renaming intermediary preprocessing output files.

With the following input files as example, matching the following input file pattern: {sample}_{readpair}.fq.gz

input/sample1_1.fq.gz
input/sample1_2.fq.gz

The output files from preprocessing steps will be:

fastp/sample1_1.fq.gz
fastp/sample1_2.fq.gz

and

host_removal/sample1_1.fq.gz
host_removal/sample1_2.fq.gz

Note that these output filenames will be the same regardless of which input file pattern is used. So for input files matching the following pattern: {sample}_R{readpair}.fastq.gz (more Illumina-like), the fastp and host_removal output files will still have the same name as shown above.

To shortcut the DAG (and start StaG from a different starting point), the users need to put symlinks to the input files they want to use in the output_dir/fastp/ or output_dir/host_removal/ directories (depending on where they want to start StaG from), and make sure the filenames conform to the hardcoded pattern {sample}_{readpair}.fq.gz, and put this pattern in the input_fn_pattern variable in the configuration file (this is the default pattern). Additionally, users need to set the inputdir variable in the configuration file to the directory in which they put the symlinks so StaG can find all the sample names by looking at the input file names.