List parameters and their annotations in `global` section with `-h` option
gaow opened this issue · 34 comments
So suppose I have this workflow:
$ sos run workflows/fastqtl_to_mash.ipynb -h -v3
Available workflows
export
default
convert
Sections
convert_0: Generate utility functions
convert_1: Convert summary stats gzip format to HDF5
convert_2: Merge single study data to multivariate data
default_3: Extract data to fit MASH model
default_4: Compute for MASH EZ model and save to RDS
export: Export notebook to HTML file
it neatly lists all steps in the workflow. But under [global]
I have something like this:
parameter: cwd = path('./gtex6_workflow_output')
parameter: data_list = path("data/eQTL_summary_files.txt")
parameter: gene_list = path()
parameter: msg = "GTEx eQTL summary statistics"
# maximum number of groups per HDF5 file
parameter: maxsize = 1000
parameter: cols = [8, 9, 7]
parameter: keep_ensg_version = 0
# number of null samples per gene to train MASH mixture
parameter: null_per_gene = 9
# a list of effect names (SNP names) to exclude from mash analysis
parameter: exclude_mash_effects_list = 'NULL'
# a list of condition names (tissue names) to include in mash output.
# Default includes all conditions.
parameter: include_mash_condition_list = 'NULL'
# size of MASH train set
parameter: mash_train_size = 20000
Correct me if I am wrong, but I thought we can list parameters with -v
? I do not think it works, though. It would be great if we can list parameters along with what they are, eg:
--exclude_mash_effects_list: a list of effect names (SNP names) to exclude from mash analysis
Not sure if #
is the best identifier for parameter comments to show, but we can also do ##
or #'
(roxygen
convention). I'm just saying that showing parameter options could be useful when I make quick software-type of workflows for others to run without looking into the details.
We had something like this when all parameters were defined in a [parameters]
section. It is more difficult to do now because parameters are scattered in the script, and are parsed only when the steps are called. In the extreme case, parameters are only invoked with dynamically created subworkflows. That is to say, we cannot determine in advance which parameters belong to which workflow, etc.
So the best that can be done, but perhaps already good enough, is to list all parameter definitions (and their comments) in the script, regardless the workflow. Then there can be weird things such as parameters with the same names but different type/meaning that were designed for different workflows.
I see, that is why I thought we had that option before.
Then there can be weird things such as parameters with the same names but different type/meaning that were designed for different workflows.
I'm a bit confused here: for this workflow for example:
[a]
parameter: t=1
[b]
parameter: t=path('/tmp')
Then in the command line --parameter 1
will it work? I assume it will, for a
but not b
? Or, i am assuming a
and b
never got executed in one workflow, that is, sus_run('a+b')
should quit on some error due to parameter conflicts? I'm just wondering what the behaviors will be and see if they are intuitive.
The behavior is that sos run a
or sos run b
will include relevant sections and create workflows so hopefully only one version of the parameter will be used. sos_run('a+b')
will lead to undefined behaviors, and the current behavior is that whichever parameter t=
statement will consume the command line argument and the next parameter t=
will be ignored.
I see, okay I think this is fair enough. Banning it entirely may be a bit restrictive.
Let me know how this works.
Note that in the early versions of SoS when all parameters are defined in a separate section, we include comments before parameter definition as comments for the parameter. This is deprecated because it can be confused with section comment, which we still honor. E.g.
[10]
# this steps does this
parameter: a = 1
We could enforce something like
[10]
# this is section comment
parameter: a = 1 # for whatever purpose
if it is helpful.
Just noticed that comment after parameter:
will be displayed in help message due to the way parameter value is saved during parsing stage.
parameter: a = 1 # for whatever purpose
This is great! Here is what I have for a workflow I'm on right now, for example:
$ sos run gtex6_mash_analysis.ipynb -h
Usage:
sos-runner gtex6_mash_analysis.ipynb [workflow_name] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
options: Single-hyphen sos parameters (see "sos-runner -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
sfa_download
default
flash
mash-paper
mash_scripts_download
mash-fast
sfa
export
Global Parameters:
cwd path('./gtex6_workflow_output')
data path("data/MatrixEQTLSumStats.Portable.Z.rds")
mash_src file_target("/opt/mash-paper/main.R")
sfa_exe file_target("/opt/sfa/bin/sfa_linux")
algorithm 'paper'
vhat 1
alpha 1
Sections
sfa_download: Download / install SFA (no need if running from docker `gaow/mash-paper`)
sfa: Perform SFA analysis (time estimate: <1min)
Parameters:
K 5
mash_scripts_download: Download / install MASH scripts (no need if running from docker `gaow/mash-paper`)
mash-paper_1: Compute data-driven prior matrices (time estimate: 40min to 4hrs depending on the
machine power)
mash-paper_2: Add in canonical configurations and single rank SFA priors (time estimate: <1min)
mash-paper_3: Fit MASH mixture model (time estimate: ~2.5hr)
mash-paper_4: Posterior inference on the "top" set of gene-snp pairs (time estimate: ~5hr on single
thread)
flash: Perform FLASH analysis (time estimate: 20min)
mash-fast_1: Compute data-driven / canonical prior matrices (time estimate: ~2h)
mash-fast_2: Fit MASH mixture model (time estimate: <5min)
mash-fast_3: Posterior inference on the "top" set of gene-snp pairs (time estimate: <3min)
export: Export notebook to HTML file
default: Run all analysis in this notebook
Now some suggestions:
- Replace
sos-runner
withsos run
because most people would usesos run
command? - How are
Workflows
ordered? They are not ordered the same way as appeared in the notebook, which is a bit less readable ... - For parameters, global or Section specific: can we display
--
so that users will know how to change them? eg:
Global Parameters:
--cwd path('./gtex6_workflow_output')
--data path("data/MatrixEQTLSumStats.Portable.Z.rds")
--mash_src file_target("/opt/mash-paper/main.R")
--sfa_exe file_target("/opt/sfa/bin/sfa_linux")
--algorithm 'paper'
--vhat 1
--alpha 1
Sections
sfa_download: Download / install SFA (no need if running from docker `gaow/mash-paper`)
sfa: Perform SFA analysis (time estimate: <1min)
Parameters:
--K 5
and makes it very clear that those are parameters?
Other than that, this is mostly perfect, making SoS now very much like an executable, particularly when dockerized.
Try again.
Cool, although when I tried to add comments:
Global Parameters:
--cwd path('./gtex6_workflow_output')
--data_list path("data/eQTL_summary_files.txt")
--gene_list path()
--msg "GTEx eQTL summary statistics"
--maxsize 1000 # maximum number of groups per HDF5 file
--cols [8, 9, 7]
--keep_ensg_version 0
--random_per_gene 9 # number of random SNPs to draw per gene for fitting MASH mixture
--random_snp_size 20000 # size of mash random SNPs. This specifies the size of `random` SNP set and the rest of random SNPs go into `random_test` SNP set.
--effects_list path('NULL') # a list of effect names (SNP names) to include in mash analysis.
--conditions_list path('NULL') # a list of condition names (tissue names) to include in mash analysis
This is fine I guess, because I figure it might be hard to properly add line-wrap etc unless trying to do some serious parsing ...
We have not decided how to add comments to parameters yet. # comment
is right now a side effect of having #
part of the default value (before parsing/evaluation). If we agree on this convention, I can display it properly with line wrapping etc.
Well, I would like this more:
# number of random SNPs to draw per gene for fitting MASH mixture
parameter: random_per_gene = 9
if possible :)
But we have the ambiguity of
[10]
# step or parameter comment?
parameter: a = 1
Unless we can introduce some convention such as
[10]
# step comment
## parameter comment
parameter: a = 1
Indeed. I was originally proposing the roxygen convention #'
, not sure how you feel about it.
My gut feeling is that sos should try to "understand what users write" instead of "force users to write in sos way". I mean, instead of introducing some complex rules such as
[10]
#! step comment
# free comment
#' parameter comment
parameter: a = 1
I would rather do
[10]
# step comment
# free comment
# parameter comment
parameter: a = 1
[10]
# step comment
# free comment
# parameter comment
parameter: a = 1
is implemented. Let me know how it works.
Okay you are saying that comments immediately before parameter will be parameter comment? That is reasonable to me. And the way to distinguish is to use line breaks?
I checked the current behavior. It is good!
Yes. Consecutive comments immediately after section header will be section comment, consecutive comments immediately before parameter definition will be parameter comment.
Okay maybe another finally minor point, sorry for being a nuisance: in the help message, those parameters are termed workflow_options: Double-hyphen workflow-specific parameters
. So I think maybe we also change Global Parameters
to Global workflow options
and Parameters
under Sections
to workflow options
?
BTW this looks nice to me:
usage: sos run fastqtl_to_mash.ipynb
[workflow_name] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
convert
default
export
Global Parameters:
--cwd path('./gtex6_workflow_output')
--data_list path("data/eQTL_summary_files.txt")
A text file containing data-set names
--gene_list path() Optionally, a list of gene names.
--msg "eQTL mapping summary statistics"
Meta-info tag for HDF5 database
--cols [8, 9, 7] Columns for betahat, se(betahat) and p-value (1-based indexing)
--keep_ensg_version 0
Sections
convert_0: Generate utility functions
convert_1: Convert summary stats gzip format to HDF5
convert_2: Merge single study data to multivariate data
default_3: Extract data to fit MASH model
Parameters:
--random_per_gene 9
Number of random SNPs to draw per gene for fitting MASH mixture
default_4: Compute for MASH EZ model and save to RDS
Parameters:
--random_snp_size 20000
Size of mash random SNPs. This specifies the size of `random` SNP set and the rest of
random SNPs go into `random_test` SNP set.
--effects_list path('NULL')
A list of effect names (SNP names) to include in mash analysis.
--conditions_list path('NULL')
A list of condition names (tissue names) to include in mash analysis
export: Export notebook to HTML file
Need to go. I also tried to change
--data_list path("data/eQTL_summary_files.txt")
to
--data_list data/eQTL_summary_files.txt (as path)
You can change this or other formats at the end of utils.py
.
Great! I find the 24 character rule does not always work well, so I changed it to new line for all comments,
usage: sos run fastqtl_to_mash.ipynb
[workflow_name] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
convert
default
document_it
Global Workflow Options:
--cwd gtex6_workflow_output (as path)
--data-list data/eQTL_summary_files.txt (as path)
A text file containing data-set names
--gene-list . (as path)
Optionally, a list of gene names.
--msg 'eQTL mapping summary statistics'
Meta-info tag for HDF5 database
--cols 8 9 7 (as list)
Columns for betahat, se(betahat) and p-value (1-based
indexing)
--keep-ensg-version 0 (as int)
Sections
convert_0: Generate utility functions
convert_1: Convert summary stats gzip format to HDF5
convert_2: Merge single study data to multivariate data
default_3: Extract data to fit MASH model
Workflow Options:
--random-per-gene 9 (as int)
Number of random SNPs to draw per gene for fitting MASH
mixture
default_4: Compute for MASH EZ model and save to RDS
Workflow Options:
--random-snp-size 20000 (as int)
Size of mash random SNPs. This specifies the size of
`random` SNP set and the rest of random SNPs go into
`random_test` SNP set.
--effects-list NULL (as path)
A list of effect names (SNP names) to include in mash
analysis.
--conditions-list NULL (as path)
A list of condition names (tissue names) to include in
mash analysis
document_it: Export workflow to HTML document
I'm quite happy with the outcome.
The output is actually not accurate because the workflows could be triggered by targets. These are more difficult to collect and display though.
This update lists the targets the script provides and another way to start the script -t targets
. Please close the ticket if you are ok with the end result.
After more thought, I think having section comment before the section makes more sense and is more consistent. I mean
# step comment
[10]
# free comment
# parameter comment
parameter: a = 1
That is to say, anything immediately before section header or parameter will be their comment.
One additional rule could be the second comment block would be script description
#!/usr/bin/env sos-runner
# this is script comment
#
# this is script comment
# this is private stuff
# global parameter
parameter: a = 1
# this is first section
[10]
# this is implementation details
# parameter
parameter: b = 2
Documented here. The advantage of the new convention is that you can do
#global par
parameter: gp = 1
[10]
# section par
parameter: sp = 2
without worrying about conflicts with script and section comments.
I want to demonstrate this feature with one of the SoS examples but I noticed this issue:
sos-docs/src/examples$ sos run Process_Oriented.sos -h
Workflow Options:
--N 40 200 (as tuple)
training and testing samples
--rstd 3 (as int)
training and testing samples
--replicate 1 2 3 4 5 (as list)
training and testing samples
The script has
# training and testing samples
parameter: N = ..
parameter: rstd = ..
...
apparently because I did not have a comment for rstd
etc, those parameters incorrectly uses comments from the first parameter N
and displays them.
Try again.
In my RNA-seq DE example, I noticed that for comment strings starting at the begining of a cell then the content will not be displayed with -h
. For example:
$ sos run RNASeqDE.ipynb -h
usage: sos run RNASeqDE.ipynb [workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
sra
hg19_reference
star
align
Global Workflow Options:
--cwd /home/gaow/GIT/LargeFiles/RNAseqDE (as path)
--samples SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521 (as list)
--ncpu 4 (as int)
Sections
sra:
hg19_reference_1:
hg19_reference_2: Use `twoBitToFa` to extract `hg19.fa` from `hg19.2bit`
hg19_reference_3: Download `Homo_sapiens.GRCh38.91.gtf.gz` from Ensembl
star: Prepare STAR reference data
align_1:
align_2:
you see that sra
, hg19_reference_1
and align_1
and align_2
has no documentation, but they in fact do have in the notebook! I guess it is because they appear at the first line of a cell and somehow got ignored?
Yes, I changed the comment style for plain sos, but forgot to change sos notebook.
Perhaps you have not noticed, the sos comment style has been changed to "before content". That is to say, comments immediately before the section header (and parameter definition) will be comment for the header (and parameter). This style is more consistent than, for example, having comments after section header and before parameter etc, and avoids confusion of a common scenario
[section]
# comment
parameter: a = 1
When comes to sos notebook, currently sos extracts lines after section headers so that we do not extract magics in cases such as
%run
[section]
# comment
This works before since #comment
after [section]
counts, but not now. Therefore, the code needs to be changed to include comments immediately before [section]
.
We still have problem with script comment though. Currently the plain format takes the second comment block as script comment, but there is no such concept in the notebook format.
Perhaps you have not noticed, the sos comment style has been changed to "before content"
Yes, and this is now I currently implements in SoS Notebook. I think with the exception of the first comment in the cell everything works as expected even though you said you did not change it for notebooks. That is, in SoS Notebook lines before section header are actually extracted, but not when they are the first lines of cell.
Please check if everything works ok now (extract from sos, not sos-notebook, which is still broken), and if not, change the sample notebook and test case to demonstrate the problem.
It works now for my example:
usage: sos run RNASeqDE.ipynb [workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
sra
hg19_reference
star
align
Global Workflow Options:
--cwd /home/gaow/GIT/LargeFiles/RNAseqDE (as path)
--samples SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521 (as list)
--ncpu 4 (as int)
Sections
sra: decrypt SRA files
hg19_reference_1: Download `hg19.2bit` and `twoBitToFa` from UCSC
hg19_reference_2: Use `twoBitToFa` to extract `hg19.fa` from `hg19.2bit`
hg19_reference_3: Download `Homo_sapiens.GRCh38.91.gtf.gz` from Ensembl
star: Prepare STAR reference data
align_1: Align to SAM file
align_2: Convert SAM to BAM file & zap the input
and this is extracted from SoS Notebook (am I missing something again?)! So i'm not sure if I should close the ticket ...
The extraction was implemented in sos
so that it does not have to depend on sos ntoebook
. It is re-implemented as a converter in sos-notebook, and again in JS for sos-notebook for magics to work...
I see what you meant now! Okay this ticket is about -h
and since it also seems you've fixed ipynb related JS code I guess we can close the ticket now and use new tickets for future issues.