Comments on first draft of pipeline implementation

Question

Closed this issue 4 years ago · 1 comments

See this patch for my edits. Some comments:

For text before pipeline code

In the markdown documentations, please try use richer Markdown syntax such as proper bullet points - (not --), and triple "`" symbol for chunks of codes. I have changed some of those
I suggest you try rephrase the "Overview" section, to practice your writing skills.
Please use hyperlinks to external references, documents or codes you discuss in your notebook. For example a link to file should be provided and embedded to the text where you mentioned that we use a customized version of R script. Also for example links to PLINK and GEMMA websites. Because you have downloaded them anyways you must have visited those sites. Keeping track of them is good habit.
The input and output sections should be improved. We don't introduce command options in this section. Command options are always best introduced in the pipeline as comment strings that will display with -h. I think you did that already -- I have also made some edits to your comment strings for the parameters. What we need to discuss for input and output are notes on file formats, for people who want to prepare their own input data, and try to interpret the output data. I suggest you rework those sections.
Working example: please use relative paths, not absolute path. Also there is no need to add quote to strings.

Global section:

We should simply require and ensure that some software are available. There is no need to make them parameters. What you have for now is not 100% useful anyways because which plink will be empty for those who dont have the software. Instead please see depends: executable('plink') etc that I added to the pipeline. This is the way to make sure they exist.
Notice i changed input format quite a bit for genotype data

Step 1:

As I have commented in person earlier, let's name variables to something with some meanings. I therefore renamed your STEP_* workflow step names.
There is no need to cd . You can use workdir = option for the script to run -- I have implemented that.
Notice I used stderr and stdout. This will collect all screen outputs and save to those files. Thsi is good practice for workflows -- you dont want to jam your screen, particularly when there are many parallel jobs each printing something to the terminal.
There is no need to echo outputs, if you just want ot check whether your pipeline reads correctly -- you can always use sos dryrun to display the code and not run them.

Step 2:

Notice i used cache folder. This is because at the end of the day, everything in cache folder can be safely deleted to save storage space if desired. This step generates intermediate results that we can put in cache.
Don't echo anything. If you want to check whether your script is written fine, please use sos dryrun which will print the script for you to check.
Don't use cd in script. Instead, use workdir = ... in SoS

There are more comments I will not list here. Let's discuss them in person.

Answer 1 · 2020-09-14T14:11:22.000Z

@hsun3163 just a reminder that please fix up comments and documentation type of text when you get a chance. Otherwise, we are all set!