Comments on first draft of pipeline implementation
Closed this issue · 1 comments
gaow commented
See this patch for my edits. Some comments:
For text before pipeline code
- In the markdown documentations, please try use richer Markdown syntax such as proper bullet points
-
(not--
), and triple "`" symbol for chunks of codes. I have changed some of those - I suggest you try rephrase the "Overview" section, to practice your writing skills.
- Please use hyperlinks to external references, documents or codes you discuss in your notebook. For example a link to file should be provided and embedded to the text where you mentioned that we use a customized version of R script. Also for example links to PLINK and GEMMA websites. Because you have downloaded them anyways you must have visited those sites. Keeping track of them is good habit.
- The input and output sections should be improved. We don't introduce command options in this section. Command options are always best introduced in the pipeline as comment strings that will display with
-h
. I think you did that already -- I have also made some edits to your comment strings for the parameters. What we need to discuss for input and output are notes on file formats, for people who want to prepare their own input data, and try to interpret the output data. I suggest you rework those sections. - Working example: please use relative paths, not absolute path. Also there is no need to add quote to strings.
For pipeline implementation
Global section:
- We should simply require and ensure that some software are available. There is no need to make them parameters. What you have for now is not 100% useful anyways because
which plink
will be empty for those who dont have the software. Instead please seedepends: executable('plink')
etc that I added to the pipeline. This is the way to make sure they exist. - Notice i changed input format quite a bit for genotype data
Step 1:
- As I have commented in person earlier, let's name variables to something with some meanings. I therefore renamed your
STEP_*
workflow step names. - There is no need to
cd
. You can useworkdir =
option for the script to run -- I have implemented that. - Notice I used
stderr
andstdout
. This will collect all screen outputs and save to those files. Thsi is good practice for workflows -- you dont want to jam your screen, particularly when there are many parallel jobs each printing something to the terminal. - There is no need to
echo
outputs, if you just want ot check whether your pipeline reads correctly -- you can always usesos dryrun
to display the code and not run them.
Step 2:
- Notice i used
cache
folder. This is because at the end of the day, everything in cache folder can be safely deleted to save storage space if desired. This step generates intermediate results that we can put in cache. - Don't
echo
anything. If you want to check whether your script is written fine, please usesos dryrun
which will print the script for you to check. - Don't use
cd
in script. Instead, useworkdir = ...
in SoS
There are more comments I will not list here. Let's discuss them in person.